Large-Scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline

Murahari, Vishvak; Batra, Dhruv; Parikh, Devi; Das, Abhishek

doi:10.1007/978-3-030-58523-5_20

Cited by 96 publications

(109 citation statements)

References 40 publications

Supporting

Mentioning

108

Contrasting

Unclassified

Order By: Relevance

“…The VQA models we consider are BUTD (Anderson et al, 2018), BAN (Kim et al, 2018), Pythia (Jiang et al, 2018) and VisualBERT (Li et al, 2019). For VisDial we use FGA (Schwartz et al, 2019) and VisDial-BERT (Murahari et al, 2020). We trained all the models using their official implementations.…”

Section: Methodsmentioning

confidence: 99%

Are VQA Systems RAD? Measuring Robustness to Augmented Data with Focused Interventions

Rosenberg¹,

Gat²,

Feder³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

Deep learning algorithms have shown promising results in visual question answering (VQA) tasks, but a more careful look reveals that they often do not understand the rich signal they are being fed with. To understand and better measure the generalization capabilities of VQA systems, we look at their robustness to counterfactually augmented data. Our proposed augmentations are designed to make a focused intervention on a specific property of the question such that the answer changes. Using these augmentations, we propose a new robustness measure, Robustness to Augmented Data (RAD), which measures the consistency of model predictions between original and augmented examples. Through extensive experimentation, we show that RAD, unlike classical accuracy measures, can quantify when stateof-the-art systems are not robust to counterfactuals. We find substantial failure cases which reveal that current VQA systems are still brittle. Finally, we connect between robustness and generalization, demonstrating the predictive power of RAD for performance on unseen augmentations. 1

show abstract

Section: Methodsmentioning

confidence: 99%

Are VQA Systems RAD? Measuring Robustness to Augmented Data with Focused Interventions

Rosenberg¹,

Gat²,

Feder³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

show abstract

“…However, they mainly focus on textual tasks. They cannot effectively deal with the multi-modal tasks, such as image-text retrieval, image captioning, multimodal machine translation (Lin et al, 2020a;Su et al, 2021) and visual dialog (Murahari et al, 2020).…”

Section: Text Enhance Visionmentioning

confidence: 99%

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning

Li¹,

Gao²,

Niu³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

172

View full text Add to dashboard Cite

Existed pre-training methods either focus on single-modal tasks or multi-modal tasks, and cannot effectively adapt to each other. They can only utilize single-modal data (i.e., text or image) or limited multi-modal data (i.e., image-text pairs). In this work, we propose a UNIfied-MOdal pre-training architecture, namely UNIMO, which can effectively adapt to both single-modal and multi-modal understanding and generation tasks. Large scale of free text corpus and image collections are utilized to improve the capability of visual and textual understanding, and crossmodal contrastive learning (CMCL) is leveraged to align the textual and visual information into a unified semantic space, over a corpus of image-text pairs augmented with related images and texts. With the help of rich non-paired single-modal data, our model is able to learn more generalizable representations, by allowing textual knowledge and visual knowledge to enhance each other in the unified semantic space. The experimental results show that UNIMO greatly improves the performance of several singlemodal and multi-modal downstream tasks. Our code and pre-trained models are public at https:

show abstract

“…Grounded embeddings are used for many consequential tasks in natural language processing, like visual dialog (Murahari et al, 2019) and visual question answering (Hu et al, 2019). Many realworld tasks such as scanning documents and interpreting images in context employ joint embeddings as the performance gains are significant over using separate embeddings for each modality.…”

Section: Introductionmentioning

confidence: 99%

Measuring Social Biases in Grounded Vision and Language Embeddings

Ross¹,

Katz²,

Barbu³

2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

We generalize the notion of measuring social biases in word embeddings to visually grounded word embeddings. Biases are present in grounded embeddings, and indeed seem to be equally or more significant than for ungrounded embeddings. This is despite the fact that vision and language can suffer from different biases, which one might hope could attenuate the biases in both. Multiple ways exist to generalize metrics measuring bias in word embeddings to this new setting. We introduce the space of generalizations (Grounded-WEAT and Grounded-SEAT) and demonstrate that three generalizations answer different yet important questions about how biases, language, and vision interact. These metrics are used on a new dataset, the first for grounded bias, created by augmenting standard linguistic bias benchmarks with 10,228 images from COCO, Conceptual Captions, and Google Images. Dataset construction is challenging because vision datasets are themselves very biased. The presence of these biases in systems will begin to have real-world consequences as they are deployed, making carefully measuring bias and then mitigating it critical to building a fair society.

show abstract

Large-Scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline

Cited by 96 publications

References 40 publications

Are VQA Systems RAD? Measuring Robustness to Augmented Data with Focused Interventions

Are VQA Systems RAD? Measuring Robustness to Augmented Data with Focused Interventions

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning

Measuring Social Biases in Grounded Vision and Language Embeddings

Contact Info

Product

Resources

About