Proceedings of the 29th ACM International Conference on Multimedia 2021
DOI: 10.1145/3474085.3475179
|View full text |Cite
|
Sign up to set email alerts
|

Semi-Autoregressive Image Captioning

Abstract: Current state-of-the-art approaches for image captioning typically adopt an autoregressive manner, i.e., generating descriptions word by word, which suffers from slow decoding issue and becomes a bottleneck in real-time applications. Non-autoregressive image captioning with continuous iterative refinement, which eliminates the sequential dependence in a sentence generation, can achieve comparable performance to the autoregressive counterparts with a considerable acceleration. Nevertheless, based on a well-desi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
2

Relationship

1
6

Authors

Journals

citations
Cited by 15 publications
(2 citation statements)
references
References 42 publications
0
2
0
Order By: Relevance
“…The attention mechanism is first introduced to augment vanilla recurrent network (Bahdanau, Cho, and Bengio 2014;Luong, Pham, and Manning 2015) (Li et al 2019;Fei 2019;Cornia et al 2020;Pan et al 2020;Fei 2021;Yan et al 2021;Ji et al 2021) are proposed to replace conventional RNN, achieving new state-of-the-art performances. However, as far as we concerned, improving the attention distribution with self-supervised mask perturbation has never been studied in image captioning task, which push forward our exploration in this paper.…”
Section: Related Workmentioning
confidence: 99%
“…The attention mechanism is first introduced to augment vanilla recurrent network (Bahdanau, Cho, and Bengio 2014;Luong, Pham, and Manning 2015) (Li et al 2019;Fei 2019;Cornia et al 2020;Pan et al 2020;Fei 2021;Yan et al 2021;Ji et al 2021) are proposed to replace conventional RNN, achieving new state-of-the-art performances. However, as far as we concerned, improving the attention distribution with self-supervised mask perturbation has never been studied in image captioning task, which push forward our exploration in this paper.…”
Section: Related Workmentioning
confidence: 99%
“…Besides, (Fei 2020) introduces latent variables to eliminate the modal gap and develop a more powerful probabilistic framework to simulate more complicated distributions. (Yan et al 2021) splits the captions into word groups averagely and produces the group synchronously. In addition to parallel generation, a range of semi-autoregressive models (Wang, Zhang, and Chen 2018;Ghazvininejad, Levy, and Zettlemoyer 2020;Stern et al 2019;Gu, Wang, and Zhao 2019;Fei 2021b;Fei et al 2022b,a;Zhou et al 2021) pay attention to nonmonotonic sequence generation with limited forms of autoregressiveness, i.e., tree-like traversal, which are mainly based on the insertion operation.…”
Section: Introductionmentioning
confidence: 99%