2022
DOI: 10.1007/s11042-022-13793-0
|View full text |Cite
|
Sign up to set email alerts
|

Multilevel attention and relation network based image captioning model

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
6
2

Relationship

1
7

Authors

Journals

citations
Cited by 15 publications
(4 citation statements)
references
References 64 publications
0
4
0
Order By: Relevance
“…To accomplish this objective, the author presents a Locality-Sensitive Transformer Network (LSTNet) with two novel designs Locality-Sensitive Attention and Locality-Sensitive Fusion (LSF). In [29], a Local Relation Network (LRN) was planned over the objects and image regions that not only determines the connection among the object and image regions among them creates major context-based features equivalent to all the regions from the image. Lastly, a different typical LSTM utilizes an attention process that concentrates on related contextual data, spatial places, and deep visual features.…”
Section: Related Workmentioning
confidence: 99%
“…To accomplish this objective, the author presents a Locality-Sensitive Transformer Network (LSTNet) with two novel designs Locality-Sensitive Attention and Locality-Sensitive Fusion (LSF). In [29], a Local Relation Network (LRN) was planned over the objects and image regions that not only determines the connection among the object and image regions among them creates major context-based features equivalent to all the regions from the image. Lastly, a different typical LSTM utilizes an attention process that concentrates on related contextual data, spatial places, and deep visual features.…”
Section: Related Workmentioning
confidence: 99%
“…Cornia et al [27] extend selfattention with additional "slots" to encode prior information. Sharma et al [29] design an LRN that discovers the relationship between the object and the image regions. Apart from this, BERT, a pre-trained model composed of deep bidirectional transformers, achieved outstanding results.…”
Section: Image Captioningmentioning
confidence: 99%
“…An important factor in vision‐to‐language issues like visual question answering and image captioning is the existence of textual information in real‐world images, such as signs, descriptions, and promotional materials 5–9 . However, existing VQA methods 10–14 struggle to comprehend scene text, leading to a decline in performance when answering text‐related questions.…”
Section: Introductionmentioning
confidence: 99%
“…An important factor in vision-to-language issues like visual question answering and image captioning is the existence of textual information in real-world images, such as signs, descriptions, and promotional materials. [5][6][7][8][9] However, existing VQA methods [10][11][12][13][14] struggle to comprehend scene text, leading to a decline in performance when answering text-related questions. In order to efficiently address scene-text based questions, a VQA method must have the capability to recognize and understand scene-text, as well as reason based on the recognized scene-text.…”
Section: Introductionmentioning
confidence: 99%