MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering

Liu, Fangyu; Piccinno, Francesco; Krichene, Syrine; Pang, Chenxi; Lee, Kenton; Joshi, Mandar; Altün, Yasemin; Collier, Nigel; Eisenschlos, Julian Martin

doi:10.48550/arxiv.2212.09662

Cited by 4 publications

(19 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As depicted in Figure 3, our approach diverged from LLaVA's methodology of using a fixed set of questions to prompt GPT-4 [9]. We observed a greater diversity in our question set, owing to their generation process conditioned on text context.…”

Section: Dataset Constructionmentioning

confidence: 89%

“…They are based on the Pix2Struct model, which was pre-trained on website visual understanding (from screenshot to HTML code) [8]. Matcha fine-tuned Pix2Struct on various datasets, such as Github IPython notebooks for [chart > code], a mix of PlotQA, web-crawled data, and Wikipedia tables for [chart > table], and math reasoning datasets for [image > answer] [9]. DePlot, in turn, was further fine-tuned on [chart > table] datasets, including ChartQA, to specialize in converting charts into linearized tables with titles, legends, and interpolated data point values [10].…”

Section: Chart Vqa Expert Systemsmentioning

confidence: 99%

“…The output of Deplot is a text string data table that can be fed into such large autoregressive decoders as GPT-3 or Palm. The combination of Deplot and LLM is currently a state-of-the-art benchmark-tuned model for the ChartQA and PlotQA datasets, surpassing VisionTAPAS with OCR input without the need for an external OCR system [9] [10].…”

Section: Chart Vqa Expert Systemsmentioning

confidence: 99%

“…After training a projection layer with image-text pairs from the CC3M dataset as a feature alignment stage, the language model and projection layer were jointly trained for instruction-tuning. This enabled LLaVA to perform various visual understanding and reasoning tasks [9]. LLaVA used a frozen CLIP Vit-L/14 model.…”

Section: Multi-modal Large Language Models (Mllm)mentioning

confidence: 99%

“…The first-stage training was done on a significantly larger data mixture that includes LAION-400M [19], COYO-700M [20], Conceptual Captions [21], and MSCOCO [22]. The second-stage instruction dataset includes Vicuna's sharegpt [23] and LLaVA (multi-modal) [9] In this study, we proposed to evaluate the abilities of the MLLMs on our SciGraphQA datasets. We performed zero-shot evaluations to assess their performance in this out-of-distribution domain of scientific graphs.…”

Section: Multi-modal Large Language Models (Mllm)mentioning

confidence: 99%

See 4 more Smart Citations

Multi-Scale Attention for Audio Question Answering

Li¹,

Xu²,

Hu³

2023

Interspeech 2023

View full text Add to dashboard Cite

In this work, we present SciGraphQA, a synthetic multi-turn question-answer dataset related to academic graphs. SciGraphQA is 13 times larger than ChartVQA, the previously largest chart-visual question-answering dataset. It is also the largest open-sourced chart VQA dataset with non-synthetic charts. To build our dataset, we selected 290,000 Computer Science or Machine Learning ArXiv papers published between 2010 and 2020, and then used Palm-2 to generate 295K samples of openvocabulary multi-turn question-answering dialogues about the graphs. As context, we provided the text-only Palm-2 with paper title, abstract, paragraph mentioning the graph, and rich text contextual data from the graph itself, obtaining dialogues with an average 2.23 question-answer turns for each graph. We asked GPT-4 to assess the matching quality of our question-answer turns given the paper's context, obtaining an average rating of 8.7/10 on our 3K test set.We evaluated the 0-shot capability of the most popular MLLM models such as LLaVa, mPLUGowl, BLIP-2, and openFlamingo's on our dataset, finding LLaVA-13B being the most performant with a CIDEr score of 0.08. We further enriched the question prompts for LLAVA by including the serialized data tables extracted from the graphs using the DePlot model, boosting LLaVA's 0-shot CIDEr to 0.15. To verify the validity of our dataset, we also fine-tuned LLaVa using our dataset, reaching a substantially higher CIDEr score of 0.26. We anticipate further accuracy improvement by including segmentation mask tokens and leveraging larger LLM backbones coupled with emergent prompting techniques. Our code and data are open-sourced at Github and HuggingFace dataset.

show abstract

Section: Dataset Constructionmentioning

confidence: 89%

Section: Chart Vqa Expert Systemsmentioning

confidence: 99%

Section: Chart Vqa Expert Systemsmentioning

confidence: 99%

Section: Multi-modal Large Language Models (Mllm)mentioning

confidence: 99%

Section: Multi-modal Large Language Models (Mllm)mentioning

confidence: 99%

See 3 more Smart Citations

Multi-Scale Attention for Audio Question Answering

Li¹,

Xu²,

Hu³

2023

Interspeech 2023

View full text Add to dashboard Cite

show abstract

Progress and Perspective of Artificial Intelligence and Machine Learning of Prediction in Anesthesiology

Xia

Jiang

2021

J. Shanghai Jiaotong Univ. (Sci.)

View full text Add to dashboard Cite

Charts are common in literature across different scientific fields, conveying rich information easily accessible to readers. Current chart-related tasks focus on either chart perception which refers to extracting information from the visual charts, or performing reasoning given the extracted data, e.g. in a tabular form. In this paper, we aim to establish a unified and label-efficient learning paradigm for joint perception and reasoning tasks, which can be generally applicable to different downstream tasks, beyond the question-answering task as specifically studied in peer works. Specifically, StructChart first reformulates the chart information from the popular tubular form (specifically linearized CSV) to the proposed Structured Triplet Representations (STR), which is more friendly for reducing the task gap between chart perception and reasoning due to the employed structured information extraction for charts. We then propose a Structuring Chart-oriented Representation Metric (SCRM) to quantitatively evaluate the performance for the chart perception task. To enrich the dataset for training, we further explore the possibility of leveraging the Large Language Model (LLM), enhancing the chart diversity in terms of both chart visual style and its statistical information. Extensive experiments are conducted on various chart-related tasks, demonstrating the effectiveness and promising potential for a unified chart perception-reasoning paradigm to push the frontier of chart understanding. Our simulation chart dataset is available at: https://github.com/UniModal4Reasoning/SimChart9K, and the pre-training code will be released for reproduction.

show abstract

SciCapenter: Supporting Caption Composition for Scientific Figures with Machine-Generated Captions and Ratings

Hsu,

Huang,

Huang

et al. 2024

Extended Abstracts of the CHI Conference on Human Factors in Computing Systems

View full text Add to dashboard Cite

Crafting effective captions for figures is important. Readers heavily depend on these captions to grasp the figure's message. However, despite a well-developed set of AI technologies for figures and captions, these have rarely been tested for usefulness in aiding caption writing. This paper introduces SciCapenter, an interactive system that puts together cutting-edge AI technologies for scientific figure captions to aid caption composition. SciCapenter generates a variety of captions for each figure in a scholarly article, providing scores and a comprehensive checklist to assess caption quality across multiple critical aspects, such as helpfulness, OCR mention, key takeaways, and visual properties reference. Users can directly edit captions in SciCapenter, resubmit for revised evaluations, and iteratively refine them. A user study with Ph.D. students indicates that SciCapenter significantly lowers the cognitive load of caption writing. Participants' feedback further offers valuable design insights for future systems aiming to enhance caption writing.

show abstract

MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering

Cited by 4 publications

References 0 publications

Multi-Scale Attention for Audio Question Answering

Multi-Scale Attention for Audio Question Answering

Progress and Perspective of Artificial Intelligence and Machine Learning of Prediction in Anesthesiology

SciCapenter: Supporting Caption Composition for Scientific Figures with Machine-Generated Captions and Ratings

Contact Info

Product

Resources

About