2023
DOI: 10.48550/arxiv.2302.00923
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Multimodal Chain-of-Thought Reasoning in Language Models

Abstract: Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, existing CoT studies have focused on the language modality. We propose Multimodal-CoT that incorporates language (text) and vision (images) modalities into a two-stage framework that separates rationale generation and answer inference. In this way, answer inference can leverage better generated… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
50
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
2
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 34 publications
(50 citation statements)
references
References 16 publications
0
50
0
Order By: Relevance
“…In contrast, LLaMA-Adapter can be easily switched into multi-modal variant and achieves +10% higher accuracy. Besides, we notice that MM-CoT [59] is on par with our approach, but it relies on the complex two-stage inference. We believe our LLaMA-Adapter can also be boosted and leave the exploration of chain-of-thought for future research.…”
Section: Performancementioning
confidence: 98%
See 1 more Smart Citation
“…In contrast, LLaMA-Adapter can be easily switched into multi-modal variant and achieves +10% higher accuracy. Besides, we notice that MM-CoT [59] is on par with our approach, but it relies on the complex two-stage inference. We believe our LLaMA-Adapter can also be boosted and leave the exploration of chain-of-thought for future research.…”
Section: Performancementioning
confidence: 98%
“…In Table 2, we compare LLaMA-Adapter with popular VQA methods [2,9,17,18,22,23,28,53] and language models [5,16,59]. As shown, our single-modal variant ('LLaMA-Adapter T ') attains 78.31% accuracy with 1.2M parameters.…”
Section: Performancementioning
confidence: 99%
“…Reasoning is a crucial component of human intelligence that enables us to draw inferences, make decisions, and solve complex problems. However, even trained with large scale dataset, sometimes GAI models could still fail at common sense reasoning tasks [256,257]. Recently, more and more researchers began to focus on this problem.…”
Section: Privacymentioning
confidence: 99%
“…By incorporating this approach, large language models can achieve higher accuracy and better performance in tasks that require logical reasoning. CoT has also been applied to other areas like vision language question answering [257] and code generation [258]. However, it still remains a problem that how to construct these CoT prompts according to specific tasks.…”
Section: Privacymentioning
confidence: 99%
“…Specifically, [101] tackles VCR by captioning the image, and then feed the caption together with the existing linguistic input to the LLM. Another promising work in this direction introduces Multimodal-CoT without using language as the mediating modality, proposing a two-stage process to separately infer the answer A and the rationale R, while stating that a LM with less than 1B parameters is adequate for state-of-the-art performance [102]. It is expected that the rapid rise of popularity of LLMs in complex linguistic QA reasoning [103] may soon give rise to more LLM-augmented VCR approaches, addressing more aspects of reasoning.…”
Section: Visual Commonsense Reasoning (Vcr)mentioning
confidence: 99%