Language identification from visual-only speech signals

Ronquest, Rebecca; Levi, Susannah V.; Pisoni, David B.

doi:10.3758/app.72.6.1601

Cited by 28 publications

(31 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Another feature of language is rhythm. [41] explains that babies have the ability to distinguish languages based on acoustic rhythm, and [42] suggests that adults also have this ability and furthermore, rhythm is expressed visually. Further work into VLID could therefore focus on incorporating both of these additional language cues and evaluating their contribution to language discrimination.…”

Section: Overall Conclusion and Future Workmentioning

confidence: 99%

Language Identification Using Visual Features

Newman

Cox

2012

IEEE Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Abstract-Automatic visual language identification (VLID) is the technology of using information derived from the visual appearance and movement of the speech articulators to identify the language being spoken, without the use of any audio information. This technique for language identification (LID) is useful in situations in which conventional audio processing is ineffective (very noisy environments), or impossible (no audio signal is available). Research in this field is also beneficial in the related field of automatic lip-reading. This paper introduces several methods for visual language identification (VLID). They are based upon audio LID techniques, which exploit language phonology and phonotactics to discriminate languages. We show that VLID is possible in a speaker-dependent mode by discriminating different languages spoken by an individual, and we then extend the technique to speaker-independent operation, taking pains to ensure that discrimination is not due to artefacts, either visual (e.g. skin-tone) or audio (e.g. rate of speaking). Although the low accuracy of visual speech recognition currently limits the performance of VLID, we can obtain an error-rate of < 10% in discriminating between Arabic and English on 19 speakers and using about 30s of visual speech.

show abstract

Section: Overall Conclusion and Future Workmentioning

confidence: 99%

Language Identification Using Visual Features

Newman

Cox

2012

IEEE Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

show abstract

“…Specifically, Ronquest et al [21] carry out experiments in which participants are asked to observe, without listening, video clips that show a male or a female speaker talking in English or Spanish. Both speakers appearing in the videos are bilingual in English and Spanish.…”

Section: B Visual Speech and Accentmentioning

confidence: 99%

“…However, the beneficial role of visual information to speech comprehension has been well documented [17] and experimentally validated [18]- [20]. Furthermore, recent findings indicate that human observers can actually perform language identification through the visual modality only [21]. Automated approaches for visual-only language identification have also been proposed (e.g., [22]).…”

Section: Introductionmentioning

confidence: 99%

Discrimination Between Native and Non-Native Speech Using Visual Features Only

Georgakis

Petridis

Pantić

2016

IEEE Trans. Cybern.

View full text Add to dashboard Cite

Abstract-Accent is a soft biometric trait that can be inferred from pronunciation and articulation patterns characterising the speaking style of an individual. Past research has addressed the task of classifying accent, as belonging to a native language speaker or a foreign language speaker, by means of the audio modality only. However, features extracted from the visual stream of speech have been successfully used to extend or substitute audio-only approaches that target speech or language recognition. Motivated by these findings, we investigate to what extent temporal visual speech dynamics attributed to accent can be modelled and identified when the audio stream is missing or noisy, and the speech content is unknown. We present here a fully automated approach to discriminating native from non-native English speech, based exclusively on visual cues. A systematic evaluation of various appearance and shape features for the target problem is conducted, with the former consistently yielding superior performance. Subject-independent cross-validation experiments are conducted on mobile phone recordings of continuous speech and isolated word utterances spoken by 56 subjects from the challenging MOBIO Database. High performance is achieved on a text-dependent protocol, with the best score of 76.5% yielded by fusion of five HMMs trained on appearance features. Our framework is also efficient even when tested on examples of speech unseen in the training phase, although performing less accurately compared to the text-dependent case.

show abstract

“…For example, it has been found that both bilingual and monolingual Spanish and Catalan speakers, but not speakers of English and Italian, can discriminate Catalan and Spanish using only visual cues (Soto-Faraco et al, 2007). Also both monolingual and bilingual English- and Spanish-speaking adults have been shown to discriminate between Spanish and English only on the basis visual cues (Ronquest et al, 2010). Likewise, Navarra et al (2014) showed that English and Spanish/Catalan adult speakers do exploit visual information concerning the temporal distribution of consonant and vowel intervals to discriminate languages that differ in this speech property such as English and Japanese.…”

Section: Discussionmentioning

confidence: 99%

“…For example, bilingual Spanish-Catalan as well as monolingual Spanish and Catalan speakers, but not monolingual speakers of English and Italian, can discriminate Catalan and Spanish using only visual cues (Soto-Faraco et al, 2007). Also monolingual and bilingual English- and Spanish-speaking adults have been shown to discriminate between Spanish and English—two languages differing at the basic rhythmic level—only on the basis of the visual cues provided by speaking faces (Ronquest et al, 2010). These results suggest that adult listeners can discriminate between rhythmically similar (Spanish and Catalan) as well as rhythmically different (English and Spanish) languages by analyzing the facial mimic when they know at least one of the two languages.…”

Section: Introductionmentioning

confidence: 99%

Rhythm on Your Lips

et al. 2016

View full text Add to dashboard Cite

The Iambic-Trochaic Law (ITL) accounts for speech rhythm, grouping of sounds as either Iambs—if alternating in duration—or Trochees—if alternating in pitch and/or intensity. The two different rhythms signal word order, one of the basic syntactic properties of language. We investigated the extent to which Iambic and Trochaic phrases could be auditorily and visually recognized, when visual stimuli engage lip reading. Our results show both rhythmic patterns were recognized from both, auditory and visual stimuli, suggesting that speech rhythm has a multimodal representation. We further explored whether participants could match Iambic and Trochaic phrases across the two modalities. We found that participants auditorily familiarized with Trochees, but not with Iambs, were more accurate in recognizing visual targets, while participants visually familiarized with Iambs, but not with Trochees, were more accurate in recognizing auditory targets. The latter results suggest an asymmetric processing of speech rhythm: in auditory domain, the changes in either pitch or intensity are better perceived and represented than changes in duration, while in the visual domain the changes in duration are better processed and represented than changes in pitch, raising important questions about domain general and specialized mechanisms for speech rhythm processing.

show abstract

Language identification from visual-only speech signals

Cited by 28 publications

References 28 publications

Language Identification Using Visual Features

Language Identification Using Visual Features

Discrimination Between Native and Non-Native Speech Using Visual Features Only

Rhythm on Your Lips

Contact Info

Product

Resources

About