Cogans For Unsupervised Visual Speech Adaptation To New Speakers

Fernandez-Lopez, Adriana; Karaali, Ali; Harte, Naomi; Sukno, Federico M.

doi:10.1109/icassp40776.2020.9053299

Cited by 6 publications

(4 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In fact, it has been proven that each person produces speech in a unique way [36], a finding that supports the idea that visual speech features are highly sensitive to the identity of the speaker [29]. However, although a wide range of works have studied the speaker adaptation of end-to-end systems in the field of ASR [37][38][39][40], only a few works in this regard have addressed VSR [41,42]. Although this speaker-dependent approach makes for a less demanding task, it should not be forgotten that speaker-adapted VSR systems could be helpful in a non-invasive and inconspicuous way for people who suffer from communication difficulties [15][16][17].…”

Section: Introductionmentioning

confidence: 93%

“…Kandala et al [41] defined an architecture based on the CTC paradigm [23] where, after computing visual speech features, a speaker-specific identity vector was integrated as an additional input to the decoder. Fernandez-Lopez et al [42] approached the problem indirectly, studying how to adapt the visual front end of an audiovisual recognition system. Specifically, the authors proposed an unsupervised method that allowed an audiovisual system to be adapted when only visual data were available.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish

Gimeno-Gómez

Martínez-Hinarejos

2023

Applied Sciences

View full text Add to dashboard Cite

Visual speech recognition (VSR) is a challenging task that aims to interpret speech based solely on lip movements. However, although remarkable results have recently been reached in the field, this task remains an open research problem due to different challenges, such as visual ambiguities, the intra-personal variability among speakers, and the complex modeling of silence. Nonetheless, these challenges can be alleviated when the task is approached from a speaker-dependent perspective. Our work focuses on the adaptation of end-to-end VSR systems to a specific speaker. Hence, we propose two different adaptation methods based on the conventional fine-tuning technique, the so-called Adapters. We conduct a comparative study in terms of performance while considering different deployment aspects such as training time and storage cost. Results on the Spanish LIP-RTVE database show that both methods are able to obtain recognition rates comparable to the state of the art, even when only a limited amount of training data is available. Although it incurs a deterioration in performance, the Adapters-based method presents a more scalable and efficient solution, significantly reducing the training time and storage cost by up to 80%.

show abstract

Section: Introductionmentioning

confidence: 93%

Section: Related Workmentioning

confidence: 99%

Comparing Speaker Adaptation Methods for Visual Speech Recognition for Continuous Spanish

Gimeno-Gómez

Martínez-Hinarejos

2023

Applied Sciences

View full text Add to dashboard Cite

show abstract

“…Kandala et al [16] defined an architecture based on the Connectionist Temporal Classification (CTC) paradigm [26] where, once visual speech features were computed, a speakerspecific identity vector was integrated as an additional input to the decoder. Moreover, Fernandez-Lopez et al [17] approached the problem indirectly, since it was studied how to adapt the visual front-end of an audio-visual recognition system. Thus, the authors proposed an unsupervised method that allowed an audiovisual system to be adapted when only the visual channel was available.…”

Section: Related Workmentioning

confidence: 99%

“…As detailed in Section 2, there is a wide range of works which have studied the speaker adaptation of end-toend systems in the field of Acoustic Speech Recognition (ASR) [12,13,14,15]. On the contrary, few works in this regard have been addressed in VSR [16,17]. Albeit this speaker-dependent approach means facing a less demanding task, it should not be forgotten that speaker-adapted VSR systems could be helpful, in a non-invasive and inconspicuous way, for people who suffer from communication difficulties [18,19].…”

Section: Introductionmentioning

confidence: 99%

Speaker-Adapted End-to-End Visual Speech Recognition for Continuous Spanish

Gimeno-Gomez¹,

Hinarejos²

2022

IberSPEECH 2022

View full text Add to dashboard Cite

Different studies have shown the importance of visual cues throughout the speech perception process. In fact, the development of audiovisual approaches has led to advances in the field of speech technologies. However, although noticeable results have recently been achieved, visual speech recognition remains an open research problem. It is a task in which, by dispensing with the auditory sense, challenges such as visual ambiguities and the complexity of modeling silence must be faced. Nonetheless, some of these challenges can be alleviated when the problem is approached from a speaker-dependent perspective. Thus, this paper studies, using the Spanish LIP-RTVE database, how the estimation of specialized end-to-end systems for a specific person could affect the quality of speech recognition. First, different adaptation strategies based on the fine-tuning technique were proposed. Then, a pre-trained CTC/Attention architecture was used as a baseline throughout our experiments. Our findings showed that a two-step finetuning process, where the VSR system is first adapted to the task domain, provided significant improvements when the speaker adaptation was addressed. Furthermore, results comparable to the current state of the art were reached even when only a limited amount of data was available.

show abstract