2022
DOI: 10.1609/aaai.v36i1.19940
|View full text |Cite
|
Sign up to set email alerts
|

Attention-Aligned Transformer for Image Captioning

Abstract: Recently, attention-based image captioning models, which are expected to ground correct image regions for proper word generations, have achieved remarkable performance. However, some researchers have argued “deviated focus” problem of existing attention mechanisms in determining the effective and influential image features. In this paper, we present A2 - an attention-aligned Transformer for image captioning, which guides attention learning in a perturbation-based self-supervised manner, without any annotation … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
8
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
3
1

Relationship

0
10

Authors

Journals

citations
Cited by 25 publications
(8 citation statements)
references
References 38 publications
0
8
0
Order By: Relevance
“…(2022) , ViTCAP, an image captioning model based on a pure visual transformer, is proposed, in which a grid representation is used without extracting regional features. In the study of Fei (2022) , an attention-aligned converter for image captions is proposed, called A 2 , which is a perturbation-based, self-supervised way to guide attention learning without any annotation overhead. In the study of Liu et al.…”
Section: Related Workmentioning
confidence: 99%
“…(2022) , ViTCAP, an image captioning model based on a pure visual transformer, is proposed, in which a grid representation is used without extracting regional features. In the study of Fei (2022) , an attention-aligned converter for image captions is proposed, called A 2 , which is a perturbation-based, self-supervised way to guide attention learning without any annotation overhead. In the study of Liu et al.…”
Section: Related Workmentioning
confidence: 99%
“…Image Captioning. In recent years, a large number of neural systems have been proposed for the image captioning task [3,9,16,22,24,40,53,58]. The state-of-the-art approaches depend on the encoder-decoder framework to translate the image into a descriptive sentence.…”
Section: Related Workmentioning
confidence: 99%
“…Image captioning, which aims to generate textual descriptions of input images, is a critical task in multimedia analysis (Stefanini et al 2021). Previous works in this area are mostly based on an encoder-decoder paradigm (Vinyals et al 2015;Xu et al 2015;Rennie et al 2017;Anderson et al 2018;Huang et al 2019;Cornia et al 2020;Pan et al 2020;Fei 2022;Li et al 2022;Yang, Liu, and Wang 2022), where a convolution-neural-network-based image encoder first process an input image into visual representations, and then a recursive-neural-network or Transformer-based language decoder produces a corresponding caption based on these extracted features. The generation process usually relies on a chain-rule factorization and is performed in an autoregressive manner, i.e., words by words from left to right.…”
Section: Introductionmentioning
confidence: 99%