Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua 2021
DOI: 10.18653/v1/2021.naacl-main.418
|View full text |Cite
|
Sign up to set email alerts
|

MIMOQA: Multimodal Input Multimodal Output Question Answering

Abstract: Multimodal research has picked up significantly in the space of question answering with the task being extended to visual question answering, charts question answering as well as multimodal input question answering. However, all these explorations produce a unimodal textual output as the answer. In this paper, we propose a novel task -MIMOQA -Multimodal Input Multimodal Output Question Answering in which the output is also multimodal. Through human experiments, we empirically show that such multimodal outputs … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
4

Relationship

0
10

Authors

Journals

citations
Cited by 16 publications
(7 citation statements)
references
References 30 publications
0
7
0
Order By: Relevance
“…MANYMODALQA (Hannan et al, 2020) requires reasoning over prior knowledge, images, and databases. MIMOQA (Singh et al, 2021b) is an example of multimodal responses, where answers are image-text pairs.…”
Section: Antol Et Al (2015) Andmentioning
confidence: 99%
“…MANYMODALQA (Hannan et al, 2020) requires reasoning over prior knowledge, images, and databases. MIMOQA (Singh et al, 2021b) is an example of multimodal responses, where answers are image-text pairs.…”
Section: Antol Et Al (2015) Andmentioning
confidence: 99%
“…Even though there are some available multimodal QA datasets in non-clinical domains (Hannan et al, 2020;Chen et al, 2020;Talmor et al, 2021), but there are no existing multimodal QA datasets which uses structured with unstructured EHR data to answer questions. There are some existing works in the clinical genre on multimodal understanding from text-image pairs (Moon et al, 2021;Khare et al, 2021;Li et al, 2020) as well as clinical QA (Singh et al, 2021) on text-image data. But as far as the authors' knowledge, so far there is no multi-modal clinical dataset that encorporates structured and unstructured EHR data for QA.…”
Section: Related Workmentioning
confidence: 99%
“…Later, OK-VQA (Marino et al, 2019) enlarged VQA's scope to annotate questions requiring both image and implicit textual/common-sense knowledge to answer. More recently, MuMuQA (Reddy et al, 2021), ManyModelQA (Hannan et al, 2020) and MIMOQA (Singh et al, 2021) provide questions which require reasoning over images and explicitly provided text snippets. However, these datasets are restricted to dealing with given text and images without requiring any retrieval from the web: they are analogous to machine-reading approaches to QA from text like SQuAD, rather than open-book QA.…”
Section: Related Workmentioning
confidence: 99%