2020
DOI: 10.1007/978-3-030-58523-5_20
|View full text |Cite
|
Sign up to set email alerts
|

Large-Scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
108
0
1

Year Published

2020
2020
2022
2022

Publication Types

Select...
5
5

Relationship

0
10

Authors

Journals

citations
Cited by 96 publications
(109 citation statements)
references
References 40 publications
0
108
0
1
Order By: Relevance
“…The VQA models we consider are BUTD (Anderson et al, 2018), BAN (Kim et al, 2018), Pythia (Jiang et al, 2018) and VisualBERT (Li et al, 2019). For VisDial we use FGA (Schwartz et al, 2019) and VisDial-BERT (Murahari et al, 2020). We trained all the models using their official implementations.…”
Section: Methodsmentioning
confidence: 99%
“…The VQA models we consider are BUTD (Anderson et al, 2018), BAN (Kim et al, 2018), Pythia (Jiang et al, 2018) and VisualBERT (Li et al, 2019). For VisDial we use FGA (Schwartz et al, 2019) and VisDial-BERT (Murahari et al, 2020). We trained all the models using their official implementations.…”
Section: Methodsmentioning
confidence: 99%
“…However, they mainly focus on textual tasks. They cannot effectively deal with the multi-modal tasks, such as image-text retrieval, image captioning, multimodal machine translation (Lin et al, 2020a;Su et al, 2021) and visual dialog (Murahari et al, 2020).…”
Section: Text Enhance Visionmentioning
confidence: 99%
“…Grounded embeddings are used for many consequential tasks in natural language processing, like visual dialog (Murahari et al, 2019) and visual question answering (Hu et al, 2019). Many realworld tasks such as scanning documents and interpreting images in context employ joint embeddings as the performance gains are significant over using separate embeddings for each modality.…”
Section: Introductionmentioning
confidence: 99%