2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021
DOI: 10.1109/cvpr46437.2021.01521
|View full text |Cite
|
Sign up to set email alerts
|

RSTNet: Captioning with Adaptive Attention on Visual and Non-Visual Words

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
126
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 186 publications
(126 citation statements)
references
References 23 publications
0
126
0
Order By: Relevance
“…Adding the mesh-like connectivity to the decoder further improves the results to 140.6 CIDEr points. This represents an increase of 5.0 CIDEr points with respect to the current state of the art when training on the COCO dataset exclusively [44]. Further, in Fig.…”
Section: Comparison With the State Of The Artmentioning
confidence: 79%
See 2 more Smart Citations
“…Adding the mesh-like connectivity to the decoder further improves the results to 140.6 CIDEr points. This represents an increase of 5.0 CIDEr points with respect to the current state of the art when training on the COCO dataset exclusively [44]. Further, in Fig.…”
Section: Comparison With the State Of The Artmentioning
confidence: 79%
“…Language model. Despite RNN-based language models have been the standard strategy for generating the caption, convolutional language models [40] and fully-attentive language models [14], [41], [42], [43], [44] based on the Transformer paradigm [45] have been explored for image captioning, also motivated by the success of these approaches on Natural Language Processing tasks such as machine translation and language understandings [45], [46], [47]. Moreover, the introduction of Transformer-based language models has brought to the development of effective variants or modifications of the self-attention operator [7], [11], [12], [13], [48], [49], [8] and has enabled vision-and-language early-fusion [19], [22], [50], based on BERT-like architectures [46].…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Cornia et al [3] proposed the Meshed-Memory Transformer model, which included a multi-layer encoder for region features and a multi-layer decoder that generated output sentences; A mesh-like structure was also proposed to connect encoding and decoding layers to exploit both low-level and high-level contributions. The exploration of self-attention mechanism in the Image Captioning problem are still trendy up to now; many studies improve the performance on this problem via this direction [4,5,20]. On the other hand, some studies realized that just embedding visual contents was not enough; then, they attempted to combine some semantic features such as name entities or attributes of the relationship.…”
Section: ) Previous Approachesmentioning
confidence: 99%
“…This is an extremely challenging task because traditional captioning models were not adapted to the Text-based Image Captioning problem when now the hypothesis caption should be conditioned in scene texts. However, previous models just utilized only visual entities [2,3,4] or global semantic of the image [5], they completely ignored scene texts.…”
Section: Introductionmentioning
confidence: 99%