ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9413488
|View full text |Cite
|
Sign up to set email alerts
|

Audio-Visual Speech Inpainting with Deep Learning

Abstract: In this paper, we present a deep-learning-based framework for audio-visual speech inpainting, i.e., the task of restoring the missing parts of an acoustic speech signal from reliable audio context and uncorrupted visual information. Recent work focuses solely on audio-only methods and generally aims at inpainting music signals, which show highly different structure than speech. Instead, we inpaint speech signals with gaps ranging from 100 ms to 1600 ms to investigate the contribution that vision can provide fo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
14
0
2

Year Published

2022
2022
2024
2024

Publication Types

Select...
7
1

Relationship

0
8

Authors

Journals

citations
Cited by 18 publications
(16 citation statements)
references
References 32 publications
0
14
0
2
Order By: Relevance
“…We evaluate such a baseline in Section 3. Also relevant, but not directly related to our work, are methods that enhance speech based on modalities other than text: using facial crops of the target speaker, denoising [6], inpainting speech [15] or separating a target speaker from a mixture [4,20], and enhancing speech based on accelerometer data [17], just to name a few.…”
Section: Introductionmentioning
confidence: 99%
“…We evaluate such a baseline in Section 3. Also relevant, but not directly related to our work, are methods that enhance speech based on modalities other than text: using facial crops of the target speaker, denoising [6], inpainting speech [15] or separating a target speaker from a mixture [4,20], and enhancing speech based on accelerometer data [17], just to name a few.…”
Section: Introductionmentioning
confidence: 99%
“…Regarding the first challenge, a new training objective was proposed in [19], where the audio embeddings are used to reconstruct the subsequent visual images over the time dimension. Audio-visual inpainting was designed in [20], where the masked speech spectrogram is predicted using the visual information and audio contexts. A versatile network presented in [13] can be used to learn both global and local representations.…”
Section: Introductionmentioning
confidence: 99%
“…Inspired by [22,20,13], in this paper we propose a new selfsupervised audio-visual representation learning approach for AVSR. It can be seen as a multi-modal extension of Wav2vec2.0 [22].…”
Section: Introductionmentioning
confidence: 99%
“…Durante muitos anos o problema ficou em hiato, até que começou a ser novamente notado em 2019, com métodos mais modernos de restauração de áudio baseados em aprendizado profundo. Alguns destes trabalhos são focados em melhorar a qualidade de sinais fala (5,6,7) e outros sinais de música (8,9,10). Contudo, a maioria dos trabalhos publicados abrangem apenas cenários limitados de reconstrução.…”
Section: Lista De Abreviaturasunclassified
“…Morrone et al (7) mostraram que as tarefas de Audio Inpainting também podem ser abordadas de forma multimodal. Eles usam recursos de áudio e vídeo concatenado quadro a quadro como entradas para um LSTM bidirecional (BLSTM) empilhada.…”
Section: Fonte: Autor Trabalhos Relacionadosunclassified