Chunk-aware Alignment and Lexical Constraint for Visual Entailment with Natural Language Explanations

Yang, Qian; Li, Yunxin; Hu, Baotian; Ma, Lin; Ding, Yuxing; Zhang, Min

doi:10.1145/3503161.3548284

Cited by 7 publications

(2 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…So the other two methods that we use are not suitable for this task actually. But we believe that there are more effective multimodal fusion methods (Liu et al, 2023a;Li et al, 2023;Yang et al, 2022) waiting to be discovered. Also, we notice that images in the dataset vary widely, some feature only objects, but others con-tain significant text.…”

Section: Limitationsmentioning

confidence: 99%

TILFA: A Unified Framework for Text, Image, and Layout Fusion in Argument Mining

Zong,

Wang,

et al. 2023

Proceedings of the 10th Workshop on Argument Mining

View full text Add to dashboard Cite

A main goal of Argument Mining (AM) is to analyze an author's stance. Unlike previous AM datasets focusing only on text, the shared task at the 10th Workshop on Argument Mining introduces a dataset including both text and images. Importantly, these images contain both visual elements and optical characters. Our new framework, TILFA 1 (A Unified Framework for Text, Image, and Layout Fusion in Argument Mining), is designed to handle this mixed data. It excels at not only understanding text but also detecting optical characters and recognizing layout details in images. Our model significantly outperforms existing baselines, earning our team, KnowComp, the 1st place in the leaderboard 2 of Argumentative Stance Classification subtask in this shared task.

show abstract

Section: Limitationsmentioning

confidence: 99%

TILFA: A Unified Framework for Text, Image, and Layout Fusion in Argument Mining

Zong,

Wang,

et al. 2023

Proceedings of the 10th Workshop on Argument Mining

View full text Add to dashboard Cite

show abstract

“…In contrast, text explanations formulated in (Park et al 2018) are conducted on the VQA-NLE datasets and it utilizes human annotations to inspire the decision-making process of VQA models. (Kayser et al 2021) combines a pre-trained language model and a VL model to generate free-text explanations while (Yang et al 2022) uses stronger VL models (Li et al 2020a) and generation models (Radford et al 2019).…”

Section: Related Workmentioning

confidence: 99%

Towards More Faithful Natural Language Explanation Using Multi-Level Contrastive Learning in VQA

Lai,

Song,

Meng

et al. 2024

AAAI

View full text Add to dashboard Cite

Natural language explanation in visual question answer (VQA-NLE) aims to explain the decision-making process of models by generating natural language sentences to increase users' trust in the black-box systems. Existing post-hoc methods have achieved significant progress in obtaining a plausible explanation. However, such post-hoc explanations are not always aligned with human logical inference, suffering from the issues on: 1) Deductive unsatisfiability, the generated explanations do not logically lead to the answer; 2) Factual inconsistency, the model falsifies its counterfactual explanation for answers without considering the facts in images; and 3) Semantic perturbation insensitivity, the model can not recognize the semantic changes caused by small perturbations. These problems reduce the faithfulness of explanations generated by models. To address the above issues, we propose a novel self-supervised Multi-level Contrastive Learning based natural language Explanation model (MCLE) for VQA with semantic-level, image-level, and instance-level factual and counterfactual samples. MCLE extracts discriminative features and aligns the feature spaces from explanations with visual question and answer to generate more consistent explanations. We conduct extensive experiments, ablation analysis, and case study to demonstrate the effectiveness of our method on two VQA-NLE benchmarks.

show abstract

S³C: Semi-Supervised VQA Natural Language Explanation via Self-Critical Learning

Suo,

Sun,

Liu

et al. 2023

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

Chunk-aware Alignment and Lexical Constraint for Visual Entailment with Natural Language Explanations

Cited by 7 publications

References 21 publications

TILFA: A Unified Framework for Text, Image, and Layout Fusion in Argument Mining

TILFA: A Unified Framework for Text, Image, and Layout Fusion in Argument Mining

Towards More Faithful Natural Language Explanation Using Multi-Level Contrastive Learning in VQA

S³C: Semi-Supervised VQA Natural Language Explanation via Self-Critical Learning

Contact Info

Product

Resources

About

Chunk-aware Alignment and Lexical Constraint for Visual Entailment with Natural Language Explanations

Cited by 7 publications

References 21 publications

TILFA: A Unified Framework for Text, Image, and Layout Fusion in Argument Mining

TILFA: A Unified Framework for Text, Image, and Layout Fusion in Argument Mining

Towards More Faithful Natural Language Explanation Using Multi-Level Contrastive Learning in VQA

S3C: Semi-Supervised VQA Natural Language Explanation via Self-Critical Learning

Contact Info

Product

Resources

About

S³C: Semi-Supervised VQA Natural Language Explanation via Self-Critical Learning