Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

Lü, Ping; Mishra, Swaroop; Tony, Xia,; Liang, Qing Quan; Chang, Kai-Wei; Zhu, Song‐Chun; Tafjord, Oyvind; Clark, Peter; Kalyan, Ashwin

doi:10.48550/arxiv.2209.09513

Cited by 10 publications

(8 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Previous efforts to train models to use explanations (Mishra et al, 2022 ), whether from scratch (Camburu et al, 2018 , Lampinen, Roy, et al, 2022 ), through fine-tuning (Lampinen, Dasgupta, et al, 2022 ), or through conditioning with in-context prompts at evaluation time (Lu et al, 2022 ; Wei et al, 2022 ), have shown improved performance over models without explicit explanations. However, much of the existing literature remains largely empirical with limited theoretic accounts for the phenomenon (Xie et al, 2021 ).…”

Section: Discussionmentioning

confidence: 99%

Systematic Human Learning and Generalization From a Brief Tutorial With Explanatory Feedback

Nam,

McClelland

2024

Open Mind

View full text Add to dashboard Cite

We investigate human adults’ ability to learn an abstract reasoning task quickly and to generalize outside of the range of training examples. Using a task based on a solution strategy in Sudoku, we provide Sudoku-naive participants with a brief instructional tutorial with explanatory feedback using a narrow range of training examples. We find that most participants who master the task do so within 10 practice trials and generalize well to puzzles outside of the training range. We also find that most of those who master the task can describe a valid solution strategy, and such participants perform better on transfer puzzles than those whose strategy descriptions are vague or incomplete. Interestingly, fewer than half of our human participants were successful in acquiring a valid solution strategy, and this ability was associated with completion of high school algebra and geometry. We consider the implications of these findings for understanding human systematic reasoning, as well as the challenges these findings pose for building computational models that capture all aspects of our findings, and we point toward a role for learning from instructions and explanations to support rapid learning and generalization.

show abstract

Section: Discussionmentioning

confidence: 99%

Systematic Human Learning and Generalization From a Brief Tutorial With Explanatory Feedback

Nam,

McClelland

2024

Open Mind

View full text Add to dashboard Cite

show abstract

“…One potential application of multimodal information retrieval is multimodal reasoning. Lu et al (2022a) first introduce ScienceQA, a large-scale multimodal science question dataset annotated with lectures and explanations. Based on this benchmark, propose Multimodal Chain-of-Thought (Multimodal-CoT) which incorporates language and vision modalities into a twostage (rationale generation and answer inference) framework, surpassing GPT-3.5 by a large margin with a much smaller fine-tuned model.…”

Section: Retrieval Augmented Multimodal Reasoningmentioning

confidence: 99%

Retrieving Multimodal Information for Augmented Generation: A Survey

Zhao¹,

Chen²,

Wang³

et al. 2023

Preprint

View full text Add to dashboard Cite

In this survey, we review methods that retrieve multimodal knowledge to assist and augment generative models. This group of works focuses on retrieving grounding contexts from external sources, including images, codes, tables, graphs, and audio. As multimodal learning and generative AI have become more and more impactful, such retrieval augmentation offers a promising solution to important concerns such as factuality, reasoning, interpretability, and robustness. We provide an in-depth review of retrieval-augmented generation in different modalities and discuss potential future directions. As this is an emerging field, we continue to add new papers and methods.

show abstract

“…Chain of Thought (CoT) reasoning is inherently tied to VCR, as reasoning paths are highly associated with selecting rationales R. The rise in popularity of CoT techniques for linguistic tasks is highly interconnected with the development of LLMs, which have been proven able to reveal intermediate reasoning steps [57]. There are not yet many works in the VL direction, even though the introduction of novel appropriate datasets with grounded answer rationales highlight the prospects of such an approach [101]. Specifically, [101] tackles VCR by captioning the image, and then feed the caption together with the existing linguistic input to the LLM.…”

Section: Visual Commonsense Reasoning (Vcr)mentioning

confidence: 99%

“…There are not yet many works in the VL direction, even though the introduction of novel appropriate datasets with grounded answer rationales highlight the prospects of such an approach [101]. Specifically, [101] tackles VCR by captioning the image, and then feed the caption together with the existing linguistic input to the LLM. Another promising work in this direction introduces Multimodal-CoT without using language as the mediating modality, proposing a two-stage process to separately infer the answer A and the rationale R, while stating that a LM with less than 1B parameters is adequate for state-of-the-art performance [102].…”

Section: Visual Commonsense Reasoning (Vcr)mentioning

confidence: 99%

The Contribution of Knowledge in Visiolinguistic Learning: A Survey on Tasks and Challenges

Lymperaiou¹,

Stamou²

2023

Preprint

View full text Add to dashboard Cite

Recent advancements in visiolinguistic (VL) learning have allowed the development of multiple models and techniques that offer several impressive implementations, able to currently resolve a variety of tasks that require the collaboration of vision and language. Current datasets used for VL pre-training only contain a limited amount of visual and linguistic knowledge, thus significantly limiting the generalization capabilities of many VL models. External knowledge sources such as knowledge graphs (KGs) and Large Language Models (LLMs) are able to cover such generalization gaps by filling in missing knowledge, resulting in the emergence of hybrid architectures. In the current survey, we analyze tasks that have benefited from such hybrid approaches. Moreover, we categorize existing knowledge sources and types, proceeding to discussion regarding the KG vs LLM dilemma and its potential impact to future hybrid approaches.

show abstract

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

Cited by 10 publications

References 0 publications

Systematic Human Learning and Generalization From a Brief Tutorial With Explanatory Feedback

Systematic Human Learning and Generalization From a Brief Tutorial With Explanatory Feedback

Retrieving Multimodal Information for Augmented Generation: A Survey

The Contribution of Knowledge in Visiolinguistic Learning: A Survey on Tasks and Challenges

Contact Info

Product

Resources

About