2021
DOI: 10.1609/aaai.v35i3.16328
|View full text |Cite
|
Sign up to set email alerts
|

Dual-level Collaborative Transformer for Image Captioning

Abstract: Descriptive region features extracted by object detection networks have played an important role in the recent advancements of image captioning. However, they are still criticized for the lack of contextual information and fine-grained details, which in contrast are the merits of traditional grid features. In this paper, we introduce a novel Dual-Level Collaborative Transformer (DLCT) network to realize the complementary advantages of the two features. Concretely, in DLCT, these two features are first processe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
79
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 215 publications
(79 citation statements)
references
References 23 publications
0
79
0
Order By: Relevance
“…Earlier Stage (two-stage model) RSTNet [52] ResNeXt-101 133.3 RSTNet [52] ResNeXt-152 135.6 DLCT [25] ResNeXt-101 133.8…”
Section: Methods Backbone Cidermentioning
confidence: 99%
“…Earlier Stage (two-stage model) RSTNet [52] ResNeXt-101 133.3 RSTNet [52] ResNeXt-152 135.6 DLCT [25] ResNeXt-101 133.8…”
Section: Methods Backbone Cidermentioning
confidence: 99%
“…To represent the visual input, CNN-based solutions have been proposed for extracting global features [18,36] or grids of features [26,48], and further improved through object detectors [2,27] for obtaining a region-based features representation, and self-attention. As for the language model, in earlier works it was implemented as a recurrent neural network [15,18,20,36], while more recent approaches employ Transformer-based fully-attentive models [5,8,28,56]. The success of this latter strategy has also encouraged the proposal of multi-modal early-fusion strategies [14,22,54], which proved the effectiveness of building a semantic representation of the image by exploiting also the text at the early stages of the captioning pipeline.…”
Section: Related Workmentioning
confidence: 99%
“…To this end, image representation plays a key role, making this aspect of great interest to the community working on image captioning and, in general, on tasks connecting vision and language. For years, image captioning approaches have relied on visual representations based on detected visual entities [2,27], among which relations have been modeled via graphs [49,51] or attention mechanisms [6,8,28,31].…”
Section: Introductionmentioning
confidence: 99%
“…CNN-based decoder has also been explored in Aneja et al (2018), showing on par performance but easier to train (e.g., better training efficiency, less likely to suffer from vanishing gradients), when compared with the prominent LSTM design. Of late, Transformer-based decoder (Herdade et al, 2019;Li et al, 2019b;Cornia et al, 2020;Luo et al, 2021b) has become the most popular design choice. • Multimodal fusion.…”
Section: Similar Trends In Captioning and Retrieval Modelsmentioning
confidence: 99%