An Analysis of Visual Speech Information Applied to Voice Activity Detection

Sodoyer, David; Rivet, Bertrand; Girin, Laurent; Schwartz, Jean‐Luc; Jutten, Christian

doi:10.1109/icassp.2006.1660092

Cited by 48 publications

(34 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Sodoyer et al [6] extended this idea to combine audio-visual speech processing and blind source separation to form an early contribution to audio-visual source separation research. More recently Wang et al [7] and Sodoyer et al [8] have used visual information to help solve the convolutive case of BSS. However, audio-visual BSS is still in the early stages of research compared to audio only BSS; this paper gives an overview of the research area so far, and provides recently obtained results and suggestions for future research.…”

Section: Does the Listener Recognize What One Person Is Saying Among mentioning

confidence: 99%

“…Sodoyer et al [6], [8] extract the internal width and height of the lips using a chroma-key process and contour tracking on lips with blue makeup. Wang et al [7] and Aubrey et al [12] use facial features found on the basis of an AAM.…”

Section: Feature Extractionmentioning

confidence: 99%

“…An alternative use of audio-visual information is presented in [8]. There the visual information was used for a voice activity detector (VAD).…”

Section: Audio-visual Algorithmsmentioning

confidence: 99%

See 2 more Smart Citations

Study of Video Assisted BSS for Convolutive Mixtures

Aubrey

Hicks

Sanei

et al. 2006

2006 IEEE 12th Digital Signal Processing Workshop &Amp;amp; 4th IEEE Signal Processing Education Workshop

View full text Add to dashboard Cite

Section: Does the Listener Recognize What One Person Is Saying Among mentioning

confidence: 99%

Section: Feature Extractionmentioning

confidence: 99%

See 1 more Smart Citation

Study of Video Assisted BSS for Convolutive Mixtures

Aubrey

Hicks

Sanei

et al. 2006

2006 IEEE 12th Digital Signal Processing Workshop &Amp;amp; 4th IEEE Signal Processing Education Workshop

View full text Add to dashboard Cite

“…Even though solutions for these problems have been proposed (e.g, [11,19,32]), various researchers have argued that taking the visual signal into account (if available) can help in addressing these issues, e.g. because the presence or absence of lip movements can help in distinguishing noise from speech [35], and because visual cues can help for speech segmentation. Moreover, importantly, visual cues such as mouth and head movements typically precede the actual onset of speech [40], allowing for an earlier detection of speech events, which in turn may be beneficial for the robustness of speech recognition systems.…”

Section: Introductionmentioning

confidence: 99%

Voice activity detection based on facial movement

Joosten

Postma

Krahmer

2015

J Multimodal User Interfaces

View full text Add to dashboard Cite

We present and evaluate a new Visual Voice Activity Detection method based on Spatiotemporal Gabor filters (STem-VVAD). Since Spatiotemporal Gabor filters are dynamic, they offer an attractive method to separate speech from non-speech frames in video, even though they have not been used for this purpose before. We evaluate our method on two datasets, which differ in the ratio of speech to non-speech frames (high versus low), as well as in the head orientation of the speakers (frontal versus profile). We compare models on different regions (applied to the mouth, the head or the entire video frame), and do so both for speaker-dependent, individual models and speakerindependent, generic models. In general, best performances are obtained for speaker-dependent STem-VVAD applied to the mouth region, and combining information from different speeds. In all these cases, the system outperforms two reference systems, relying on frame differencing and static Gabor filters respectively, showing that Spatiotemporal Gabor filters indeed are beneficial for visual voice detection.

show abstract

“…In a preliminary work, lip movements have been shown to be good candidates to characterize the opposition between silence and non-silence activity (Sodoyer et al, 2006), the lip-shape variations being generally smaller in silence sections. Therefore, following this previous work, we chose to describe the lip shape movements with one dynamic parameter, summing the absolute values of the two lip parameter derivatives (Sodoyer et al, 2006):…”

Section: A Dynamic Lip Parameter For Silence Vs Non-silence Charamentioning

confidence: 99%

A study of lip movements during spontaneous dialog and its application to voice activity detection

Sodoyer

Rivet

Girin

et al. 2009

The Journal of the Acoustical Society of America

View full text Add to dashboard Cite

Running title: Voice activity detection based on lip movementsThis paper presents a quantitative and comprehensive study of the lip movements of a given speaker in different speech / non speech contexts, with a particular focus on silences (i.e., when no sound is produced by the speaker). The aim is to characterize the relationship between "lip activity" and "speech activity", and then to use visual speech information as a Voice Activity Detector (VAD). To this aim, an original audio-visual corpus was recorded with two speakers involved in a face-to-face spontaneous dialog, although being in separate rooms. Each speaker communicated with the other using a microphone, a camera, a screen, and headphones. This system was used to capture separate audio stimuli for each speaker and to monitor each

show abstract

An Analysis of Visual Speech Information Applied to Voice Activity Detection

Cited by 48 publications

References 7 publications

Study of Video Assisted BSS for Convolutive Mixtures

Study of Video Assisted BSS for Convolutive Mixtures

Voice activity detection based on facial movement

A study of lip movements during spontaneous dialog and its application to voice activity detection

Contact Info

Product

Resources

About