2023
DOI: 10.1109/tpami.2022.3148210
|View full text |Cite
|
Sign up to set email alerts
|

From Show to Tell: A Survey on Deep Learning-Based Image Captioning

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
70
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
3
2

Relationship

2
7

Authors

Journals

citations
Cited by 178 publications
(70 citation statements)
references
References 184 publications
0
70
0
Order By: Relevance
“…Although the solution to the problem of automated image description generation yields acceptable results, there is still space for improvement because certain photos are not well described. Metric-oriented focal mechanism (MFM) [23] e survey presented by [24] provides an in-depth analysis of picture captioning methods, including image encoding and text generation, as well as training methodologies, datasets, and assessment measures. e study concludes that there are still many unresolved issues including accuracy, robustness, and generalization outcomes.…”
Section: Image Captioning Methodologiesmentioning
confidence: 99%
“…Although the solution to the problem of automated image description generation yields acceptable results, there is still space for improvement because certain photos are not well described. Metric-oriented focal mechanism (MFM) [23] e survey presented by [24] provides an in-depth analysis of picture captioning methods, including image encoding and text generation, as well as training methodologies, datasets, and assessment measures. e study concludes that there are still many unresolved issues including accuracy, robustness, and generalization outcomes.…”
Section: Image Captioning Methodologiesmentioning
confidence: 99%
“…Image captioning requires both a deep understanding of the visual content within an image, its objects, attributes, and relationships and the capability of a language model to generate syntactically and semantically correct descriptions. In particular, the language model is asked to generate a sentence conditioned on the image representation, whose role is key for obtaining satisfactory results [41].…”
Section: Related Workmentioning
confidence: 99%
“…Online evaluation. Finally, we also report the performance of our method on the online COCO test server 1 . In this case, we also employ an ensemble of four models trained with the mesh-like connectivity.…”
Section: Comparison With the State Of The Artmentioning
confidence: 99%
“…Image captioning has received a relevant interest in the last few years, as describing images in natural language is a fundamental step to model the interconnections between the visual and textual modalities [1]. Early works on this research line were based on the usage of convolutional neural networks [2], [3], [4], [5] -usually pre-trained to solve classification tasks -or object detectors [6], [7], [8] as visual feature extractors, and recurrent neural networks as autoregressive language models [2], [3], [6], [9], [10].…”
Section: Introductionmentioning
confidence: 99%