2023
DOI: 10.1609/aaai.v37i9.26357
|View full text |Cite
|
Sign up to set email alerts
|

Locate Then Generate: Bridging Vision and Language with Bounding Box for Scene-Text VQA

Abstract: In this paper, we propose a novel multi-modal framework for Scene Text Visual Question Answering (STVQA), which requires models to read scene text in images for question answering. Apart from text or visual objects, which could exist independently, scene text naturally links text and visual modalities together by conveying linguistic semantics while being a visual object in an image simultaneously. Different to conventional STVQA models which take the linguistic semantics and visual semantics in scene text as … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

0
0
0

Year Published

2024
2024
2025
2025

Publication Types

Select...
2
1
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
references
References 38 publications
(87 reference statements)
0
0
0
Order By: Relevance