2020
DOI: 10.48550/arxiv.2005.01400
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Does Visual Self-Supervision Improve Learning of Speech Representations for Emotion Recognition?

Abhinav Shukla,
Stavros Petridis,
Maja Pantic

Abstract: Self-supervised learning has attracted plenty of recent research interest. However, most works are typically unimodal and there has been limited work that studies the interaction between audio and visual modalities for self-supervised learning. This work (1) investigates visual self-supervision via face reconstruction to guide the learning of audio representations; (2) proposes two audio-only self-supervision approaches for speech representation learning; (3) shows that a multi-task combination of the proposed… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2020
2020
2020
2020

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 52 publications
0
1
0
Order By: Relevance
“…We jointly optimize a family of self-supervised tasks in an encoderdecoder setup, making this work an example of multitask self-supervised learning. Multi-task self-supervised learning has been applied to other domains such as visual data [11,24], accelerometer recordings [35], audio [34] and multi-modal inputs [37,30]. Generally in each of these domains, tasks are defined ahead of time, as is the case for tasks such as frame reconstruction, colorization, finding relative position of image patches, mapping videos to optimal flow, and video-audio alignment.…”
Section: Related Workmentioning
confidence: 99%
“…We jointly optimize a family of self-supervised tasks in an encoderdecoder setup, making this work an example of multitask self-supervised learning. Multi-task self-supervised learning has been applied to other domains such as visual data [11,24], accelerometer recordings [35], audio [34] and multi-modal inputs [37,30]. Generally in each of these domains, tasks are defined ahead of time, as is the case for tasks such as frame reconstruction, colorization, finding relative position of image patches, mapping videos to optimal flow, and video-audio alignment.…”
Section: Related Workmentioning
confidence: 99%