VinVL: Revisiting Visual Representations in Vision-Language Models

Zhang, Pengchuan; Li, Xiujun; Hu, Xiaowei; Yang, Jianwei; Zhang, Lei; Wang, Lijuan; Choi, Yejin; Gao, Jianfeng

doi:10.1109/cvpr46437.2021.00553

Cited by 664 publications

(449 citation statements)

References 24 publications

Supporting

Mentioning

446

Contrasting

Order By: Relevance

“…They tried to build adaptive captioning models that could work well with multilanguages instead of only specific ones. About captioning models, some studies have recently enhanced performance based on BERT-based models [15,28,29] are also promising.…”

Section: ) Previous Approachesmentioning

confidence: 99%

“…Besides using the Bottom-up Top-down architecture for extracting visual objects from an image, we explore two more pre-trained models named RelDN and VinVL [15], respectively.…”

Section: Objects Representation Explorationmentioning

confidence: 99%

“…2) We propose the Grid features augmentation module that augments global context of the image by combining grid features with visual objects. 3) We conduct the experiment with other visual objects extractor: RelDN [14] and VinVL [15]. 4) Our proposed approach achieves better results than the M4C-Captioner baseline and competitive results with the current state-of-the-art methods.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

EAES: Effective Augmented Embedding Spaces for Text-Based Image Captioning

et al. 2022

View full text Add to dashboard Cite

Text-based Image Captioning has been a novel problem since 2020. This topic remains challenging because it requires the model to comprehend not only the visual context but also the scene texts that appear in an image. Therefore, the ways image and scene texts are embedded into the main model for training is crucial. Based on the M4C-Captioner model, this paper proposes the simple but effective EAES embedding module for effectively embedding images and scene texts into the multimodal Transformer layers. In detail, our EAES module contains two significant sub-modules: Objects-augmented and Grid feature augmentation. With the Objects-augmented module, we provide the relative geometry feature, representing the relation between objects and between OCR tokens. Furthermore, we extract the grid feature for an image with the Grid feature augmentation module and combine it with visual objects, which help the model focus on both salient objects and the general context of an image, leading to better performance. We use the TextCaps dataset as the benchmark to prove the effectiveness of our approach on five standard metrics: BLEU4, METEOR, ROUGE-L, SPICE and CIDEr. Without bells and whistles, our method achieves 20.21% on the BLEU4 metric and 85.78% on the CIDEr metric, 1.31% and 4.78% higher, respectively, than the baseline M4C-Captioner method. Furthermore, the results are incredibly competitive with other methods on METEOR, ROUGE-L and SPICE metrics. INDEX TERMS image captioning, text-based image captioning, bottom-up top-down, grid feature, multimodal transformer, m4c

show abstract

Section: ) Previous Approachesmentioning

confidence: 99%

“…Besides using the Bottom-up Top-down architecture for extracting visual objects from an image, we explore two more pre-trained models named RelDN and VinVL [15], respectively.…”

Section: Objects Representation Explorationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

EAES: Effective Augmented Embedding Spaces for Text-Based Image Captioning

et al. 2022

View full text Add to dashboard Cite

show abstract

“…• image texts, object labels, scene texts • object visual features, scene visual features Object labels and features are extracted from VinVL(Revisiting Visual Representations in Vision-Language Models) [3], which can generate representations of a richer collection of visual objects and concepts. Scene text and scene visual features are mainly coming from public Microsoft OCR API.…”

Section: Pre-training Strategymentioning

confidence: 99%

Winner Team Mia at TextVQA Challenge 2021: Vision-and-Language Representation Learning with Pre-trained Sequence-to-Sequence Model

Qiao,

Chen,

Wang

et al. 2021

Preprint

View full text Add to dashboard Cite

TextVQA requires models to read and reason about text in images to answer questions about them. Specifically, models need to incorporate a new modality of text present in the images and reason over it to answer TextVQA questions. In this challenge, we use generative model T5 for TextVQA task. Based on pre-trained checkpoint T5-3B from Hug-gingFace repository, two other pre-training tasks including masked language modeling(MLM) and relative position prediction(RPP) are designed to better align object feature and scene text. In the stage of pre-training, encoder is dedicate to handle the fusion among multiple modalities: question text, object text labels, scene text labels, object visual features, scene visual features. After that decoder generates the text sequence step-by-step, cross entropy loss is required by default. We use a large-scale scene text dataset in pre-training and then fine-tune the T5-3B with the TextVQA dataset only.

show abstract

“…In addition to the main results in Table 3, we also evaluated existing OSCAR [35] / VinVL [57] models on AdVQA, since both models are known to perform well on VQA. Note that the training set of the off-the-shelf OSCAR and VinVL models includes COCO val2014 data, which overlaps with our validation set (COCO 2017).…”

Section: A Training Detailsmentioning

confidence: 99%

Human-Adversarial Visual Question Answering

Sheng,

Singh,

Goswami

et al. 2021

Preprint

View full text Add to dashboard Cite

Performance on the most commonly used Visual Question Answering dataset (VQA v2) is starting to approach human accuracy. However, in interacting with state-of-the-art VQA models, it is clear that the problem is far from being solved. In order to stress test VQA models, we benchmark them against human-adversarial examples. Human subjects interact with a state-of-the-art VQA model, and for each image in the dataset, attempt to find a question where the model's predicted answer is incorrect. We find that a wide range of state-of-the-art models perform poorly when evaluated on these examples. We conduct an extensive analysis of the collected adversarial examples and provide guidance on future research directions. We hope that this Adversarial VQA (AdVQA) benchmark can help drive progress in the field and advance the state of the art. 3

show abstract

VinVL: Revisiting Visual Representations in Vision-Language Models

Cited by 664 publications

References 24 publications

EAES: Effective Augmented Embedding Spaces for Text-Based Image Captioning

EAES: Effective Augmented Embedding Spaces for Text-Based Image Captioning

Winner Team Mia at TextVQA Challenge 2021: Vision-and-Language Representation Learning with Pre-trained Sequence-to-Sequence Model

Human-Adversarial Visual Question Answering

Contact Info

Product

Resources

About