Creating Song From Lip and Tongue Videos With a Convolutional Vocoder

Zhang, Jianyu; Roussel, Pierre; Denby, B.

doi:10.1109/access.2021.3050843

Cited by 6 publications

(3 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the area of AAM, several different types of articulatory acquisition equipments have been used, including ultrasound tongue imaging (UTI) [4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20][21][22], electromagnetic articulography (EMA) [23][24][25][26][27], permanent magnetic articulography (PMA) [28,29], surface electromyography (sEMG) [30][31][32], electro-optical stomatography (EOS) [33], lip video [5,6,[34][35][36], continuous-wave radar [37], or multimodal combination [38]. There are basically two distinct methods of SSI solutions, namely "direct synthesis" and "recognition-and-synthesis" [2].…”

Section: Introductionmentioning

confidence: 99%

Optimizing the Ultrasound Tongue Image Representation for Residual Network-Based Articulatory-to-Acoustic Mapping

Csapó

Gosztolya

Tóth

et al. 2022

Sensors

View full text Add to dashboard Cite

Within speech processing, articulatory-to-acoustic mapping (AAM) methods can apply ultrasound tongue imaging (UTI) as an input. (Micro)convex transducers are mostly used, which provide a wedge-shape visual image. However, this process is optimized for the visual inspection of the human eye, and the signal is often post-processed by the equipment. With newer ultrasound equipment, now it is possible to gain access to the raw scanline data (i.e., ultrasound echo return) without any internal post-processing. In this study, we compared the raw scanline representation with the wedge-shaped processed UTI as the input for the residual network applied for AAM, and we also investigated the optimal size of the input image. We found no significant differences between the performance attained using the raw data and the wedge-shaped image extrapolated from it. We found the optimal pixel size to be 64 × 43 in the case of the raw scanline input, and 64 × 64 when transformed to a wedge. Therefore, it is not necessary to use the full original 64 × 842 pixels raw scanline, but a smaller image is enough. This allows for the building of smaller networks, and will be beneficial for the development of session and speaker-independent methods for practical applications. AAM systems have the target application of a “silent speech interface”, which could be helpful for the communication of the speaking-impaired, in military applications, or in extremely noisy conditions.

show abstract

Section: Introductionmentioning

confidence: 99%

Optimizing the Ultrasound Tongue Image Representation for Residual Network-Based Articulatory-to-Acoustic Mapping

Csapó

Gosztolya

Tóth

et al. 2022

Sensors

View full text Add to dashboard Cite

show abstract

“…In a multi-speaker framework, in Chapter 4 we experimented with the use of x-vectors features extracted from the speakers, leading to a marginal improvement in the spectral estimation step [37]. Zhang et al evaluated UTI and lip video based unconstrained multi-speaker voice recovery with a transfer learning strategy and encoder-decoder architecture [114]. There have been more studies on multi-speaker lip-to-speech synthesis [67,73,87,107].…”

Section: Chaptermentioning

confidence: 99%

“…In the experimental section we will experiment both with 2D and 3D Convolutional Neural Networks (CNNs) for the mapping task. The problem could also be addressed even in the lack of aligned training data using encoder-decoder networks [83,114] or video transformers [8,90].…”

Section: The Uti-to-speech Frameworkmentioning

confidence: 99%

Improvements of Silent Speech Interface Algorithms

Honarmandi Shandiz

View full text Add to dashboard Cite

Gammatone filter features are another type of speech feature extraction method that is based on modeling the human auditory system. They are calculated by filtering the speech signal with a bank of gammatone filters, which are modeled after the tuning of the auditory system's hair cells. The output of each filter is then rectified and low-pass filtered, and the resulting signals are then used as features.This function is commonly used in ANNs as an activation function for hidden layers It is able to produce speech with natural-sounding intonation and prosody.

show abstract