Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-1360
|View full text |Cite
|
Sign up to set email alerts
|

LiRA: Learning Visual Speech Representations from Audio Through Self-Supervision

Abstract: The large amount of audiovisual content being shared online today has drawn substantial attention to the prospect of audiovisual self-supervised learning. Recent works have focused on each of these modalities separately, while others have attempted to model both simultaneously in a cross-modal fashion. However, comparatively little attention has been given to leveraging one modality as a training objective to learn from the other. In this work, we propose Learning visual speech Representations from Audio via s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
7
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
7
1
1

Relationship

2
7

Authors

Journals

citations
Cited by 28 publications
(7 citation statements)
references
References 25 publications
0
7
0
Order By: Relevance
“…In particular, good audio representations can be learned by predicting handcrafted audio features [73] or by using joint audio and visual supervision [74]. Similarly, visual speech representations can be learned by predicting audio features [75]. Hence, the proposed auxiliary task provides additional supervision to the intermediate layers of the model which in turns results in better visual representations and improved performance.…”
Section: Prediction-based Auxiliary Tasksmentioning
confidence: 99%
“…In particular, good audio representations can be learned by predicting handcrafted audio features [73] or by using joint audio and visual supervision [74]. Similarly, visual speech representations can be learned by predicting audio features [75]. Hence, the proposed auxiliary task provides additional supervision to the intermediate layers of the model which in turns results in better visual representations and improved performance.…”
Section: Prediction-based Auxiliary Tasksmentioning
confidence: 99%
“…[21][22][23]) and also produces features that separate suprasegmental properties such as speaker identities [17]. However, to the best of our knowledge, [24] is the only study so far using CPC for AL.…”
Section: Methodsmentioning
confidence: 99%
“…Initialisation To investigate the impact of initialisation we consider three cases: 1) we train the model from scratch using only the LRW training set, 2) we pre-train the encoder from Fig. 1 on the LRS3 dataset [17] using the LiRA [12] selfsupervised approach and fine-tune it on the LRW training set.…”
Section: Self-distillation Modelsmentioning
confidence: 99%
“…"Scratch" denotes a model trained from scratch without using external data. "LiRA(LRS3)" indicates a self-supervised pre-trained model using LiRA[12] on the LRS3 dataset, and "LRS2&3+AVS" indicates a fully supervised pre-trained model on LRS2, LRS3 and AVSpeech.…”
mentioning
confidence: 99%