2021
DOI: 10.1109/access.2021.3050843
|View full text |Cite
|
Sign up to set email alerts
|

Creating Song From Lip and Tongue Videos With a Convolutional Vocoder

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
2
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(3 citation statements)
references
References 35 publications
0
2
0
Order By: Relevance
“…In the area of AAM, several different types of articulatory acquisition equipments have been used, including ultrasound tongue imaging (UTI) [4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20][21][22], electromagnetic articulography (EMA) [23][24][25][26][27], permanent magnetic articulography (PMA) [28,29], surface electromyography (sEMG) [30][31][32], electro-optical stomatography (EOS) [33], lip video [5,6,[34][35][36], continuous-wave radar [37], or multimodal combination [38]. There are basically two distinct methods of SSI solutions, namely "direct synthesis" and "recognition-and-synthesis" [2].…”
Section: Introductionmentioning
confidence: 99%
“…In the area of AAM, several different types of articulatory acquisition equipments have been used, including ultrasound tongue imaging (UTI) [4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20][21][22], electromagnetic articulography (EMA) [23][24][25][26][27], permanent magnetic articulography (PMA) [28,29], surface electromyography (sEMG) [30][31][32], electro-optical stomatography (EOS) [33], lip video [5,6,[34][35][36], continuous-wave radar [37], or multimodal combination [38]. There are basically two distinct methods of SSI solutions, namely "direct synthesis" and "recognition-and-synthesis" [2].…”
Section: Introductionmentioning
confidence: 99%
“…In a multi-speaker framework, in Chapter 4 we experimented with the use of x-vectors features extracted from the speakers, leading to a marginal improvement in the spectral estimation step [37]. Zhang et al evaluated UTI and lip video based unconstrained multi-speaker voice recovery with a transfer learning strategy and encoder-decoder architecture [114]. There have been more studies on multi-speaker lip-to-speech synthesis [67,73,87,107].…”
Section: Chaptermentioning
confidence: 99%
“…In the experimental section we will experiment both with 2D and 3D Convolutional Neural Networks (CNNs) for the mapping task. The problem could also be addressed even in the lack of aligned training data using encoder-decoder networks [83,114] or video transformers [8,90].…”
Section: The Uti-to-speech Frameworkmentioning
confidence: 99%