Proceedings of the 30th ACM International Conference on Multimedia 2022
DOI: 10.1145/3503161.3548284
|View full text |Cite
|
Sign up to set email alerts
|

Chunk-aware Alignment and Lexical Constraint for Visual Entailment with Natural Language Explanations

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 7 publications
(2 citation statements)
references
References 21 publications
0
2
0
Order By: Relevance
“…So the other two methods that we use are not suitable for this task actually. But we believe that there are more effective multimodal fusion methods (Liu et al, 2023a;Li et al, 2023;Yang et al, 2022) waiting to be discovered. Also, we notice that images in the dataset vary widely, some feature only objects, but others con-tain significant text.…”
Section: Limitationsmentioning
confidence: 99%
“…So the other two methods that we use are not suitable for this task actually. But we believe that there are more effective multimodal fusion methods (Liu et al, 2023a;Li et al, 2023;Yang et al, 2022) waiting to be discovered. Also, we notice that images in the dataset vary widely, some feature only objects, but others con-tain significant text.…”
Section: Limitationsmentioning
confidence: 99%
“…In contrast, text explanations formulated in (Park et al 2018) are conducted on the VQA-NLE datasets and it utilizes human annotations to inspire the decision-making process of VQA models. (Kayser et al 2021) combines a pre-trained language model and a VL model to generate free-text explanations while (Yang et al 2022) uses stronger VL models (Li et al 2020a) and generation models (Radford et al 2019).…”
Section: Related Workmentioning
confidence: 99%