Automatic Generation of Contrast Sets from Scene Graphs: Probing the Compositional Consistency of GQA

Bitton, Yonatan; Stanovsky, Gabriel; Schwartz, Roy; Elhadad, Michael

doi:10.18653/v1/2021.naacl-main.9

Cited by 21 publications

(17 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…6 Results specific concepts, corroborating the findings of (Bitton et al, 2021). Interestingly, the best performing model (LXMERT) is not always the most consistent.…”

Section: Metricssupporting

confidence: 65%

Section: Consistency As Model Comprehensionmentioning

confidence: 99%

“…Some recent work has sought to evaluate models using consistency and other metrics (Hudson and Manning, 2019;Shah et al, 2019;Ribeiro et al, 2020a;Selvaraju et al, 2020;Bitton et al, 2021). These evaluations often evaluate consistency through question entailment and implication, or simply contrasting examples in the case of (Bitton et al, 2021). While we consider such methods important for evaluating model comprehension, they often combine question types and capabilities, changing the kind of expected answer, or evaluating consistency on a tree or set of entailed questions.…”

Section: Consistency As Model Comprehensionmentioning

confidence: 99%

See 2 more Smart Citations

CARETS: A Consistency And Robustness Evaluative Test Suite for VQA

Jiménez¹,

Russakovsky²,

Narasimhan³

2022

Preprint

View full text Add to dashboard Cite

We introduce CARETS, a systematic test suite to measure consistency and robustness of modern VQA models through a series of six fine-grained capability tests. In contrast to existing VQA test sets, CARETS features balanced question generation to create pairs of instances to test models, with each pair focusing on a specific capability such as rephrasing, logical symmetry or image obfuscation. We evaluate six modern VQA systems on CARETS and identify several actionable weaknesses in model comprehension, especially with concepts such as negation, disjunction, or hypernym invariance. Interestingly, even the most sophisticated models are sensitive to aspects such as swapping the order of terms in a conjunction or changing the number of answer choices mentioned in the question. We release CARETS to be used as an extensible tool for evaluating multimodal model robustness. 1

show abstract

“…6 Results specific concepts, corroborating the findings of (Bitton et al, 2021). Interestingly, the best performing model (LXMERT) is not always the most consistent.…”

Section: Metricssupporting

confidence: 65%

Section: Consistency As Model Comprehensionmentioning

confidence: 99%

Section: Consistency As Model Comprehensionmentioning

confidence: 99%

See 1 more Smart Citation

CARETS: A Consistency And Robustness Evaluative Test Suite for VQA

Jiménez¹,

Russakovsky²,

Narasimhan³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…We generate perturbations at the level of the underlying reasoning process, in the context of QA. Last, Bitton et al (2021) used scene graphs to generate examples for visual QA. However, they assumed the existence of gold scene graph at the input.…”

Section: Related Workmentioning

confidence: 99%

“…Recently, methods for automatic generation of contrast sets were proposed. However, current methods are restricted to shallow surface perturbations (Mille et al, 2021;, specific reasoning skills , or rely on expensive annotations (Bitton et al, 2021). Thus, automatic generation of examples that test high-level reasoning abilities of models and their robustness to fine semantic distinctions remains an open challenge.…”

Section: Introductionmentioning

confidence: 99%

Break, Perturb, Build: Automatic Perturbation of Reasoning Paths Through Question Decomposition

Geva

Wolfson

Berant

2022

Transactions of the Association for Computational Linguistics

View full text Add to dashboard Cite

Recent efforts to create challenge benchmarks that test the abilities of natural language understanding models have largely depended on human annotations. In this work, we introduce the “Break, Perturb, Build” (BPB) framework for automatic reasoning-oriented perturbation of question-answer pairs. BPB represents a question by decomposing it into the reasoning steps that are required to answer it, symbolically perturbs the decomposition, and then generates new question-answer pairs. We demonstrate the effectiveness of BPB by creating evaluation sets for three reading comprehension (RC) benchmarks, generating thousands of high-quality examples without human intervention. We evaluate a range of RC models on our evaluation sets, which reveals large performance gaps on generated examples compared to the original data. Moreover, symbolic perturbations enable fine-grained analysis of the strengths and limitations of models. Last, augmenting the training data with examples generated by BPB helps close the performance gaps, without any drop on the original data distribution.

show abstract

Rethinking Data Augmentation for Robust Visual Question Answering

Chen¹,

Zheng²,

Xiao³

2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Being widely used in learning unbiased visual question answering (VQA) models, Data Augmentation (DA) helps mitigate language biases by generating extra training samples beyond the original samples. While today's DA methods can generate robust samples, the augmented training set, significantly larger than the original dataset, often exhibits redundancy in terms of difficulty or content repetition, leading to inefficient model training and even compromising the model performance. To this end, we design an Effective Curriculum Learning strategy ECL to enhance DA-based VQA methods. Intuitively, ECL trains VQA models on relatively "easy" samples first, and then gradually changes to "harder" samples, and less-valuable samples are dynamically removed. Compared to training on the entire augmented dataset, our ECL strategy can further enhance VQA models' performance with fewer training samples. Extensive ablations have demonstrated the effectiveness of ECL on various methods.

show abstract

Automatic Generation of Contrast Sets from Scene Graphs: Probing the Compositional Consistency of GQA

Cited by 21 publications

References 21 publications

CARETS: A Consistency And Robustness Evaluative Test Suite for VQA

CARETS: A Consistency And Robustness Evaluative Test Suite for VQA

Break, Perturb, Build: Automatic Perturbation of Reasoning Paths Through Question Decomposition

Rethinking Data Augmentation for Robust Visual Question Answering

Contact Info

Product

Resources

About