CMU-MOSEAS: A Multimodal Language Dataset for Spanish, Portuguese, German and French

Zadeh, AmirAli Bagher; Cao, Yingnan; Hessner, Simon; Liang, Paul Pu; Poria, Soujanya; Morency, Louis–Philippe

doi:10.18653/v1/2020.emnlp-main.141

Cited by 19 publications

(12 citation statements)

References 54 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For future work, we will continue exploring this topic and expanding the framework to include more families of languages. As more benchmarks [48,40,3] on multilingual video-text pairs become available, we are interested in enhancing the grounding between vision and language by leveraging the temporal information from videos. guages to learn stronger vision-to-monolingual-sentence alignment.…”

Section: Discussionmentioning

confidence: 99%

“…We formulate VQA as a multi-label classification problem, where the model predicts answer from the candidate pool. 3 VQA score [20] is used to compare model predictions against 10 human-annotated answers in VQA v2.0. On Visual Genome VQA Japanese, which only has one ground-truth answer to each question, we use accuracy and BLEU score as the evaluation metrics.…”

Section: Methodsmentioning

confidence: 99%

“…We use Adam optimizer [29] with a linear warmup for the first 5% of training, and set the learning rate to 4e − 4. We use Horovod and NCCL for multi-node commu- 3 We only consider top-3129 frequent answers for VQA v2.0 and top-3000 frequent answers for VQA VG Japanese. 4 BLEU score is used to compute a soft mapping score between the predicted answer and the ground-truth answer, assuming answers with many overlapping words should share similar semantic meaning.…”

Section: Methodsmentioning

confidence: 99%

See 2 more Smart Citations

UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

Zhou¹,

Zhou²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Vision-and-language pre-training has achieved impressive success in learning multimodal representations between vision and language. To generalize this success to non-English languages, we introduce UC 2 , the first machine translation-augmented framework for cross-lingual cross-modal representation learning. To tackle the scarcity problem of multilingual captions for image datasets, we first augment existing English-only datasets with other languages via machine translation (MT). Then we extend the standard Masked Language Modeling and Image-Text Matching training objectives to multilingual setting, where alignment between different languages is captured through shared visual context (i.e., using image as pivot). To facilitate the learning of a joint embedding space of images and all languages of interest, we further propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM), leveraging MT-enhanced translated data. Evaluation on multilingual image-text retrieval and multilingual visual question answering benchmarks demonstrates that our proposed framework achieves new state of the art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

Zhou¹,

Zhou²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…In the past five years, text-based aspect-level sentiment analysis has drawn much attention Chen and Qian, 2019;Zhang and Qian, 2020;Zheng et al, 2020;Tulkens and van Cranenburgh, 2020;Akhtar et al, 2020). While, multimodal target-oriented sentiment analysis has become more and more vital because of its urgent need to be applied to the industry recently (Akhtar et al, 2019;Zadeh et al, 2020;Sun et al, 2021a;Tang et al, 2019;Zhang et al, 2020bZhang et al, , 2021a. In the following, we mainly overview the limited studies of multi-modal aspect terms extraction and multi-modal aspect sentiment classification on text and image modalities.…”

Section: Related Workmentioning

confidence: 99%

Joint Multi-modal Aspect-Sentiment Analysis with Auxiliary Cross-modal Relation Detection

Ju¹,

Zhang²,

Xiao³

et al. 2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Aspect terms extraction (ATE) and aspect sentiment classification (ASC) are two fundamental and fine-grained sub-tasks in aspect-level sentiment analysis (ALSA). In the textual analysis, jointly extracting both aspect terms and sentiment polarities has been drawn much attention due to the better applications than individual sub-task. However, in the multimodal scenario, the existing studies are limited to handle each sub-task independently, which fails to model the innate connection between the above two objectives and ignores the better applications. Therefore, in this paper, we are the first to jointly perform multi-modal ATE (MATE) and multi-modal ASC (MASC), and we propose a multi-modal joint learning approach with auxiliary cross-modal relation detection for multi-modal aspect-level sentiment analysis (MALSA). Specifically, we first build an auxiliary text-image relation detection module to control the proper exploitation of visual information. Second, we adopt the hierarchical framework to bridge the multi-modal connection between MATE and MASC, as well as separately visual guiding for each sub module. Finally, we can obtain all aspect-level sentiment polarities dependent on the jointly extracted specific aspects. Extensive experiments show the effectiveness of our approach against the joint textual approaches, pipeline and collapsed multi-modal approaches.

show abstract

“…Within the same category of multimodal fusion, we plan to add datasets within the same application domains as well as to expand to new application domains. Within the current domains, we plan to include (1) the hateful memes challenge [82] as a core challenge in multimedia to ensure safer learning from ubiquitous text and images from the internet, (2) more datasets in the robotics and HCI domains where there are many opportunities for multimodal modeling, and (3) several datasets which are of broad interest but are released via licenses that restrict redistribution such as dyadic emotion recognition on IEMOCAP [21], deception prediction on from real-world Trial Data [123], and multilingual affect recognition on CMU-MOSEAS [186] which was only just recently released. We are currently working with the authors to integrate some of these datasets into MULTIBENCH in the near future.…”

Section: I11 Fusionmentioning

confidence: 99%

MultiBench: Multiscale Benchmarks for Multimodal Representation Learning

Liang¹,

Lyu²,

Fan³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Learning multimodal representations involves integrating information from multiple heterogeneous sources of data. It is a challenging yet crucial area with numerous real-world applications in multimedia, affective computing, robotics, finance, human-computer interaction, and healthcare. Unfortunately, multimodal research has seen limited resources to study (1) generalization across domains and modalities, (2) complexity during training and inference, and (3) robustness to noisy and missing modalities. In order to accelerate progress towards understudied modalities and tasks while ensuring real-world robustness, we release MULTIBENCH, a systematic and unified large-scale benchmark for multimodal learning spanning 15 datasets, 10 modalities, 20 prediction tasks, and 6 research areas. MULTIBENCH provides an automated end-to-end machine learning pipeline that simplifies and standardizes data loading, experimental setup, and model evaluation. To enable holistic evaluation, MULTIBENCH offers a comprehensive methodology to assess (1) generalization, (2) time and space complexity, and (3) modality robustness. MULTIBENCH introduces impactful challenges for future research, including scalability to large-scale multimodal datasets and robustness to realistic imperfections. To accompany this benchmark, we also provide a standardized implementation of 20 core approaches in multimodal learning spanning innovations in fusion paradigms, optimization objectives, and training approaches. Simply applying methods proposed in different research areas can improve the state-of-the-art performance on 9 15 datasets. Therefore, MULTIBENCH presents a milestone in unifying disjoint efforts in multimodal machine learning research and paves the way towards a better understanding of the capabilities and limitations of multimodal models, all the while ensuring ease of use, accessibility, and reproducibility. MULTIBENCH, our standardized implementations, and leaderboards are publicly available, will be regularly updated, and welcomes inputs from the community.Preprint. Under review.

show abstract

CMU-MOSEAS: A Multimodal Language Dataset for Spanish, Portuguese, German and French

Cited by 19 publications

References 54 publications

UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

Joint Multi-modal Aspect-Sentiment Analysis with Auxiliary Cross-modal Relation Detection

MultiBench: Multiscale Benchmarks for Multimodal Representation Learning

Contact Info

Product

Resources

About