2021
DOI: 10.48550/arxiv.2106.13488
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training

Abstract: Vision-Language Pre-training (VLP) aims to learn multi-modal representations from image-text pairs and serves for downstream vision-language tasks in a finetuning fashion. The dominant VLP models adopt a CNN-Transformer architecture, which embeds images with a CNN, and then aligns images and text with a Transformer. Visual relationship between visual contents plays an important role in image understanding and is the basic for inter-modal alignment learning. However, CNNs have limitations in visual relation lea… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
1
1
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(2 citation statements)
references
References 42 publications
0
2
0
Order By: Relevance
“…LXMERT [27] reőnes the cross-modal part to achieve advanced performance in downstream tasks. More recent models employing a double-stream fusion encoder are ALBEF [88], Visual Parsing [89] and WenLan [90]. In general, double-stream encoders demand training of two transformer models (one for each stream), which is computationally inefficient.…”
Section: Double-stream Fusion Encodermentioning
confidence: 99%
“…LXMERT [27] reőnes the cross-modal part to achieve advanced performance in downstream tasks. More recent models employing a double-stream fusion encoder are ALBEF [88], Visual Parsing [89] and WenLan [90]. In general, double-stream encoders demand training of two transformer models (one for each stream), which is computationally inefficient.…”
Section: Double-stream Fusion Encodermentioning
confidence: 99%
“…The main difference between patch and grid features is that grid features are extracted from the feature map of a convolutional model while patch features directly utilize a linear projection. Patch features were first introduced by Vision Transformer (ViT) (Dosovitskiy et al, 2021a) and then adopted by VLP models Xue et al, 2021). The advantage of using patch features is efficiency.…”
Section: B Modality Embeddingmentioning
confidence: 99%