2020
DOI: 10.1609/aaai.v34i07.6794
|View full text |Cite
|
Sign up to set email alerts
|

Multi-Question Learning for Visual Question Answering

Abstract: Visual Question Answering (VQA) raises a great challenge for computer vision and natural language processing communities. Most of the existing approaches consider video-question pairs individually during training. However, we observe that there are usually multiple (either sequentially generated or not) questions for the target video in a VQA task, and the questions themselves have abundant semantic relations. To explore these relations, we propose a new paradigm for VQA termed Multi-Question Learning (MQL). I… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
6
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2

Relationship

1
5

Authors

Journals

citations
Cited by 6 publications
(6 citation statements)
references
References 27 publications
(56 reference statements)
0
6
0
Order By: Relevance
“…CUPID [47] is a recent work studying the domain gap between video-language pre-training and fine-tuning. These pre-trained models have been applied in various downstream applications, e.g., action recognition [13], video caption [6,29], video retrieval [43], and video question answering [11,16]. The video-language data has more complex semantic structures and relationships than the image-language data, leading video-language pre-training to be a much more challenging research problem.…”
Section: Related Workmentioning
confidence: 99%
“…CUPID [47] is a recent work studying the domain gap between video-language pre-training and fine-tuning. These pre-trained models have been applied in various downstream applications, e.g., action recognition [13], video caption [6,29], video retrieval [43], and video question answering [11,16]. The video-language data has more complex semantic structures and relationships than the image-language data, leading video-language pre-training to be a much more challenging research problem.…”
Section: Related Workmentioning
confidence: 99%
“…However,there could be many questions addressed to a particular video and most of them have abundant semantic relations. In order to explore these semantic relations, [26] proposes Multi-Question Learning (MQL) that learns multiple questions jointly with their candidate answers for a particular video sequence. These joint learned representations of video and question can then be used to learn new questions.…”
Section: Mql (Lei Et Al 2020)mentioning
confidence: 99%
“…These joint learned representations of video and question can then be used to learn new questions. [26] introduces an efficient framework and training paradigm for MQL, where the relations between video and questions are modeled using attention network. This framework enables the co-training of multiple video-question pairs.…”
Section: Mql (Lei Et Al 2020)mentioning
confidence: 99%
See 1 more Smart Citation
“…Li et al (2019); Fan et al (2019); Jiang and Han (2020) model interaction between all pairs of question token-level representations and temporal-level features of the input video through similarity matrix, memory networks, and graph networks respectively. ; Le et al (2019cLe et al ( , 2020b; Lei et al (2020); Huang et al (2020) extends the previous approach by dividing a video into equal segments, sub-sampling video frames, or considering objectlevel representations of input video. We propose to replace token-level and global question representations with compositional question representations of specific entities and actions.…”
Section: Video-language Understandingmentioning
confidence: 99%