2021
DOI: 10.1609/aaai.v35i2.16219
|View full text |Cite
|
Sign up to set email alerts
|

Partially Non-Autoregressive Image Captioning

Abstract: Current state-of-the-art image captioning systems usually generated descriptions autoregressively, i.e., every forward step conditions on the given image and previously produced words. The sequential attribution causes a unavoidable decoding latency. Non-autoregressive image captioning, on the other hand, predicts the entire sentence simultaneously and accelerates the inference process significantly. However, it removes the dependence in a caption and commonly suffers from repetition or missing issues. To make… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3
2

Relationship

1
7

Authors

Journals

citations
Cited by 17 publications
(4 citation statements)
references
References 30 publications
0
4
0
Order By: Relevance
“…In [74], an auto-parsing network (APN) was proposed, which contains the probabilistic graphical model (PGM) constrained self-attention to boost the transformerbased captioning task. A partially nonautoregressive model was introduced in [75], which was able to retain the accuracy of autoregressive models and enjoy the speedup of nonautoregressive models in image captioning. RSTNet was proposed in [76] recently, which leveraged grid-augmented features and used the adaptive attention mechanism to model visual and nonvisual words for captioning images.…”
Section: B Image Captioningmentioning
confidence: 99%
“…In [74], an auto-parsing network (APN) was proposed, which contains the probabilistic graphical model (PGM) constrained self-attention to boost the transformerbased captioning task. A partially nonautoregressive model was introduced in [75], which was able to retain the accuracy of autoregressive models and enjoy the speedup of nonautoregressive models in image captioning. RSTNet was proposed in [76] recently, which leveraged grid-augmented features and used the adaptive attention mechanism to model visual and nonvisual words for captioning images.…”
Section: B Image Captioningmentioning
confidence: 99%
“…In recent years, Transformer-based architectures [9,14,17,26,32,40,59] are introduced to replace conventional RNN, achieving new stateof-the-art performances. On the other hand, lots of mask-based non-autoregressive decoding methods are studied for inference acceleration with a global perspective [13,15,17,18,20]. However, as far as we are concerned, improving the original language decoding with supervised future information from the NAIC decoder has never been studied in image captioning, which pushes forward our exploration in this paper.…”
Section: Related Workmentioning
confidence: 99%
“…image captioning (NAIC) models are proposed to improve decoding speed by predicting every word in parallel (Gu et al 2018;Fei 2019Fei , 2021b. This advantage comes at a sacrifice on performance since modeling next word is trickier when not conditioned on sufficient contexts.…”
Section: Introductionmentioning
confidence: 99%
“…(Yan et al 2021) splits the captions into word groups averagely and produces the group synchronously. In addition to parallel generation, a range of semi-autoregressive models (Wang, Zhang, and Chen 2018;Ghazvininejad, Levy, and Zettlemoyer 2020;Stern et al 2019;Gu, Wang, and Zhao 2019;Fei 2021b;Fei et al 2022b,a;Zhou et al 2021) pay attention to nonmonotonic sequence generation with limited forms of autoregressiveness, i.e., tree-like traversal, which are mainly based on the insertion operation. However, all these image captioning methods treat all words in a sentence equally and ignore the generation completeness between them, which does not match the actual human-created situation.…”
Section: Introductionmentioning
confidence: 99%