Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality

Diwan, Anuj; Berry, Layne; Choi, Eunsol; Harwath, David; Mahowald, Kyle

doi:10.48550/arxiv.2211.00768

Cited by 2 publications

(6 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…ARO (Yuksekgonul et al 2023) similarly tests visiolinguistic reasoning and consists of three types of tasks: (i) Visual Genome Attribution to test the understanding of object properties; (ii) Visual Genome Attribution to test for relational understanding between objects; and (iii) COCO-Order and Flickr30k-Order to test for order sensitivity of the words in a text, when performing image-text matching. We highlight that Winoground though slightly smaller in size than ARO is more challenging as it requires reasoning beyond visio-linguistic compositional knowledge (Diwan et al 2022).…”

Section: Benchmark Datasetsmentioning

confidence: 99%

“…Image-text models that have been constrastively trained on internet-scale data, such as CLIP (Radford et al 2021a), have been shown to have strong zero-shot classification capabilities. However, recent works (Thrush et al 2022;Diwan et al 2022) have highlighted their limitations in visio-linguistic reasoning, as shown in the challenging Winoground benchmark. Yuksekgonul et al (2023) also observe this issue and introduce a new benchmark ARO for image-text models which require a significant amount of visio-linguistic reasoning to solve.…”

Section: Related Workmentioning

confidence: 99%

“…Winoground (Thrush et al 2022;Diwan et al 2022) is a challenging vision-language dataset for evaluating the visiolinguistic characteristics of contrastively trained image-text models. The dataset consists of 400 tasks, where each task consists of two image-text pairs.…”

Section: Benchmark Datasetsmentioning

confidence: 99%

“…In the past few years, image-text contrastively pre-trained multimodal models such as CLIP (Radford et al 2021a) have shown tremendous ability to perform zero-shot classification (Mu et al 2021;Minderer et al 2022), imagetext retrieval (Diwan et al 2022;Thrush et al 2022) and image-captioning (Yu et al 2022;Li et al 2022;Mokady, Hertz, and Bermano 2021). These contrastive models are also used as a part of various state-of-the-art pipelines for downstream tasks such as segmentation (Wang et al 2021;Lüddecke and Ecker 2021), object-detection (Minderer et al 2022;Zhong et al 2021) and model interpretability (Moayeri et al 2023).…”

Section: Introductionmentioning

confidence: 99%

“…However, recent works have shown that these models fail on visio-linguistic reasoning tasks, for example identifying the relative position between objects in an image. In fact, the performance of CLIP on Winoground (Thrush et al 2022;Diwan et al 2022), a challenging benchmark for * These authors contributed equally.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Research in information systems at the University of Maryland

Basu

Hevner

Twenty-Third Annual Hawaii International Conference on System Sciences

View full text Add to dashboard Cite

Few-shot classification (FSC) entails learning novel classes given only a few examples per class after a pretraining (or meta-training) phase on a set of base classes. Recent works have shown that simply fine-tuning a pretrained Vision Transformer (ViT) on new test classes is a strong approach for FSC. Fine-tuning ViTs, however, is expensive in time, compute and storage. This has motivated the design of parameter efficient fine-tuning (PEFT) methods which fine-tune only a fraction of the Transformer's parameters. While these methods have shown promise, inconsistencies in experimental conditions make it difficult to disentangle their advantage from other experimental factors including the feature extractor architecture, pre-trained initialization and fine-tuning algorithm, amongst others. In our paper, we conduct a large-scale, experimentally consistent, empirical analysis to study PEFTs for few-shot image classification. Through a battery of over 1.8k controlled experiments on large-scale few-shot benchmarks including META-DATASET (MD) and ORBIT, we uncover novel insights on PEFTs that cast light on their efficacy in finetuning ViTs for few-shot classification. Through our controlled empirical study, we have two main findings: (i) Finetuning just the LayerNorm parameters (which we call LN-TUNE) during few-shot adaptation is an extremely strong baseline across ViTs pre-trained with both self-supervised and supervised objectives, (ii) For self-supervised ViTs, we find that simply learning a set of scaling parameters for each attention matrix (which we call ATTNSCALE) along with a domain-residual adapter (DRA) module leads to state-of-the-art performance (while being ∼ 9× more parameter-efficient) on MD. Our extensive empirical findings set strong baselines and call for rethinking the current design of PEFT methods for FSC.

show abstract

Section: Benchmark Datasetsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Benchmark Datasetsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Research in information systems at the University of Maryland

Basu

Hevner

Twenty-Third Annual Hawaii International Conference on System Sciences

View full text Add to dashboard Cite

show abstract

When are Lemons Purple? The Concept Association Bias of Vision-Language Models

Tang,

Yamada,

Zhang

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Large-scale vision-language models such as CLIP have shown impressive performance on zero-shot image classification and image-totext retrieval. However, such performance does not realize in tasks that require a finergrained correspondence between vision and language, such as Visual Question Answering (VQA). As a potential cause of the difficulty of applying these models to VQA and similar tasks, we report an interesting phenomenon of vision-language models, which we call the Concept Association Bias (CAB). We find that models with CAB tend to treat input as a bag of concepts and attempt to fill in the other missing concept crossmodally, leading to an unexpected zero-shot prediction. We demonstrate CAB by showing that CLIP's zeroshot classification performance greatly suffers when there is a strong concept association between an object (e.g. eggplant) and an attribute (e.g. color purple). We also show that the strength of CAB predicts the performance on VQA. We observe that CAB is prevalent in vision-language models trained with contrastive losses, even when autoregressive losses are jointly employed. However, a model that solely relies on autoregressive loss seems to exhibit minimal or no signs of CAB. * Equal contribution.CLIP: "In this picture, the color of the lemon is purple.

show abstract

Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality

Cited by 2 publications

References 0 publications

Research in information systems at the University of Maryland

Research in information systems at the University of Maryland

When are Lemons Purple? The Concept Association Bias of Vision-Language Models

Contact Info

Product

Resources

About