Proceedings of the 30th ACM International Conference on Multimedia 2022
DOI: 10.1145/3503161.3547840
|View full text |Cite
|
Sign up to set email alerts
|

Efficient Modeling of Future Context for Image Captioning

Abstract: Existing approaches to image captioning usually generate the sentence word-by-word from left to right, with the constraint of conditioned on local context including the given image and history generated words. There have been many studies target to make use of global information during decoding, e.g., iterative refinement. However, it is still under-explored how to effectively and efficiently incorporate the future context. To respond to this issue, inspired by that Non-Autoregressive Image Captioning (NAIC) c… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(2 citation statements)
references
References 48 publications
0
2
0
Order By: Relevance
“…This gives the resulting sentences more detail in their descriptions of the scene than traditional approaches. More recently, Fei [ 67 ] proposed a model that generates descriptions that effectively exploit the global context of the scene without implying an additional cost of inference. The model is trained with two sets: one contains the description labels, and the other includes the description of the general context of the image.…”
Section: Review and Discussionmentioning
confidence: 99%
“…This gives the resulting sentences more detail in their descriptions of the scene than traditional approaches. More recently, Fei [ 67 ] proposed a model that generates descriptions that effectively exploit the global context of the scene without implying an additional cost of inference. The model is trained with two sets: one contains the description labels, and the other includes the description of the general context of the image.…”
Section: Review and Discussionmentioning
confidence: 99%
“…Recently, the growing interest in multimodal research (Fei 2022;Li et al 2022a;Chen et al 2022;Jing et al 2020;Ma et al 2022Ma et al , 2023Ji et al 2022;Huang et al 2023;Zhao et al 2023;Wu et al 2023) at the intersection of computer vision and natural language processing has driven the development of systems that can understand and describe the world as humans do. Panoptic Narrative Grounding (PNG) (González et al 2021) is an emerging visuallygrounded language understanding task that aims to locate and segment all instances of objects and regions in an image, corresponding to a given text description using binary pixel masks.…”
Section: Introductionmentioning
confidence: 99%