Findings of the Association for Computational Linguistics: EMNLP 2021 2021
DOI: 10.18653/v1/2021.findings-emnlp.293
|View full text |Cite
|
Sign up to set email alerts
|

MURAL: Multimodal, Multitask Representations Across Languages

Abstract: Both image-caption pairs and translation pairs provide the means to learn deep representations of and connections between languages. We use both types of pairs in MURAL (MUltimodal, MUltitask Representations Across Languages), a dual encoder that solves two tasks: 1) image-text matching and 2) translation pair matching. By incorporating billions of translation pairs, MURAL extends ALIGN (Jia et al., 2021)-a state-of-the-art dual encoder learned from 1.8 billion noisy image-text pairs. When using the same encod… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
27
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 37 publications
(28 citation statements)
references
References 29 publications
1
27
0
Order By: Relevance
“…Multi-modal multi-task learning. MURAL [30] extends ALIGN to the multi-lingual setting and introduces a crosslingual objective to improve multi-lingual image and text retrieval. Concurrently to this work, DeCLIP [40] adds several additional training objectives and more data collected in-house to CLIP in order to improve data efficiency.…”
Section: Related Workmentioning
confidence: 99%
“…Multi-modal multi-task learning. MURAL [30] extends ALIGN to the multi-lingual setting and introduces a crosslingual objective to improve multi-lingual image and text retrieval. Concurrently to this work, DeCLIP [40] adds several additional training objectives and more data collected in-house to CLIP in order to improve data efficiency.…”
Section: Related Workmentioning
confidence: 99%
“…, r n ], we construct an m×n logit matrix A, where A i,j represents the compatibility of landmark phrase t i and frame r j . Logits are computed by combining signals from MURAL-large [31] -a highperforming multilingual, multimodal dual encoder trained on a mixture of 1.8b noisy image-text pairs and 6b translation pairs -and the RxR text timestamps, i.e. :…”
Section: Bootstrapping a Landmark Datasetmentioning
confidence: 99%
“…Including outbound landmarks gives a small boost in automatic evaluations. All landmarks are represented using 640-dimension image embeddings from MURAL-large [31]. Rewrite task.…”
Section: Landmarksmentioning
confidence: 99%
See 2 more Smart Citations