Proceedings of the 2019 Conference of the North 2019
DOI: 10.18653/v1/n19-1267
|View full text |Cite
|
Sign up to set email alerts
|

Strong and Simple Baselines for Multimodal Utterance Embeddings

Abstract: Human language is a rich multimodal signal consisting of spoken words, facial expressions, body gestures, and vocal intonations. Learning representations for these spoken utterances is a complex research problem due to the presence of multiple heterogeneous sources of information. Recent advances in multimodal learning have followed the general trend of building more complex models that utilize various attention, memory and recurrent components. In this paper, we propose two simple but strong baselines to lear… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
18
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
6
4

Relationship

2
8

Authors

Journals

citations
Cited by 28 publications
(18 citation statements)
references
References 44 publications
0
18
0
Order By: Relevance
“…Our text corpora originate from the following five sources: 1) WikiText-2 (Merity et al, 2017a), a dataset of formally written Wikipedia articles (we only use the first 10% of WikiText-2 which we found to be sufficient to capture formally written text), 2) Stanford Sentiment Treebank (Socher et al, 2013), a collection of 10000 polarized written movie reviews, 3) Reddit data collected from discussion forums related to politics, electronics, and relationships, 4) MELD (Poria et al, 2019), a large-scale multimodal multi-party emotional dialog dataset collected from the TV-series Friends, and 5) POM (Park et al, 2014), a dataset of spoken review videos collected across 1,000 individuals spanning multiple topics. These datasets have been the subject of recent research in language understanding (Merity et al, 2017b;Liu et al, 2019; and multimodal human language (Liang et al, 2018(Liang et al, , 2019. Table 2 summarizes these datasets.…”
Section: Sent-debiasmentioning
confidence: 99%
“…Our text corpora originate from the following five sources: 1) WikiText-2 (Merity et al, 2017a), a dataset of formally written Wikipedia articles (we only use the first 10% of WikiText-2 which we found to be sufficient to capture formally written text), 2) Stanford Sentiment Treebank (Socher et al, 2013), a collection of 10000 polarized written movie reviews, 3) Reddit data collected from discussion forums related to politics, electronics, and relationships, 4) MELD (Poria et al, 2019), a large-scale multimodal multi-party emotional dialog dataset collected from the TV-series Friends, and 5) POM (Park et al, 2014), a dataset of spoken review videos collected across 1,000 individuals spanning multiple topics. These datasets have been the subject of recent research in language understanding (Merity et al, 2017b;Liu et al, 2019; and multimodal human language (Liang et al, 2018(Liang et al, , 2019. Table 2 summarizes these datasets.…”
Section: Sent-debiasmentioning
confidence: 99%
“…There exist various exciting recent work on improved multimodal fusion techniques Liang et al, 2019a;Pham et al, 2019;Baltrušaitis et al, 2019). In addition to the simplified feature and modality concatenations, we plan to explore some of these promising tensor-based multimodal fusion networks (Liu et al, 2018;Liang et al, 2019b;Tsai et al, 2019) for more robust intent classification on AMIE dataset as future work.…”
Section: Discussionmentioning
confidence: 99%
“…There exist various exciting recent work on improved multimodal fusion techniques Liang et al, 2019a;Pham et al, 2019;Baltrušaitis et al, 2019). In addition to the simplified feature and modality concatenations, we plan to explore some of these promising tensor-based multimodal fusion networks Liang et al, 2019b;…”
Section: Discussionmentioning
confidence: 99%