2023
DOI: 10.1016/j.eswa.2022.118669
|View full text |Cite
|
Sign up to set email alerts
|

Image captioning for effective use of language models in knowledge-based visual question answering

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
3
1
1

Relationship

0
10

Authors

Journals

citations
Cited by 33 publications
(9 citation statements)
references
References 6 publications
0
9
0
Order By: Relevance
“…The advantage of this approach is that it can generate specific answers to specific questions. However, the limitations of this approach are that it requires large amounts of labeled data and it may not capture the context and complexity of the video content [38], [39], [40]. Video Retrieval using LLMs refers to the process of searching and retrieving relevant videos from a large video database using advanced language models.…”
Section: Methodsmentioning
confidence: 99%
“…The advantage of this approach is that it can generate specific answers to specific questions. However, the limitations of this approach are that it requires large amounts of labeled data and it may not capture the context and complexity of the video content [38], [39], [40]. Video Retrieval using LLMs refers to the process of searching and retrieving relevant videos from a large video database using advanced language models.…”
Section: Methodsmentioning
confidence: 99%
“…Attention mechanism mainly focuses on selective actions/things related to tasks and ignores other irrelevant actions/things. Researchers are working on designing an effective attention-based neural network for vision-related applications such as fine-grained image recognition [107,108], image classification [109,110], image captioning [111,112], and vehicle re-identification [113]. The process of vehicle re-identification based on spatiotemporal attention is shown in Figure 9.…”
Section: Vehicle Re-identification Based On Attention Mechanismmentioning
confidence: 99%
“…Nevertheless, the context-free nature of classic word embedding methods [69,70] can only serve a limited amount of cases, impeding generalization to scenarios when contextualization is necessary. Transformers leveraged on the linguistic side allowed further improvements on K-VQA models [71,72] paving the way for consequent end-to-end VL approaches.…”
Section: Visual Question Answering (Vqa)mentioning
confidence: 99%