Visual voice activity detection with optical flow

Aubrey, Andrew J.; Hicks, Yulia; Chambers, J.A.

doi:10.1049/iet-ipr.2009.0042

Cited by 30 publications

(24 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…For example, in a cocktail party scenario, looking at the speaker's face, or more precisely the movements of the lip region, helps one to comprehend the speech of interest. The bimodal coherence of audio and visual stimuli was shown to be useful for voice activity detection [7], [8], [9]. However, the above visual VAD algorithms use either only static or only dynamic features.…”

Section: Introductionmentioning

confidence: 99%

Blind source separation and visual voice activity detection for target speech extraction

Liu

Wang

2011

2011 3rd International Conference on Awareness Science and Technology (iCAST)

View full text Add to dashboard Cite

Abstract-Despite being studied extensively, the performance of blind source separation (BSS) is still limited especially for the sensor data collected in adverse environments. Recent studies show that such an issue can be mitigated by incorporating multimodal information into the BSS process. In this paper, we propose a method for the enhancement of the target speech separated by a BSS algorithm from sound mixtures, using visual voice activity detection (VAD) and spectral subtraction. First, a classifier for visual VAD is formed in the off-line training stage, using labelled features extracted from the visual stimuli. Then we use this visual VAD classifier to detect the voice activity of the target speech. Finally we apply a multi-band spectral subtraction algorithm to enhance the BSS-separated speech signal based on the detected voice activity. We have tested our algorithm on the mixtures generated artificially by the mixing filters with different reverberation times, and the results show that our algorithm improves the quality of the separated target signal.

show abstract

Section: Introductionmentioning

confidence: 99%

Blind source separation and visual voice activity detection for target speech extraction

Liu

Wang

2011

2011 3rd International Conference on Awareness Science and Technology (iCAST)

View full text Add to dashboard Cite

show abstract

“…Visual VAD has potential applications in noise reduction, speech separation or extraction and speech recognition. Some visual VAD algorithms have been proposed [1], [2], [3]. The method in [1] first projects the mouth region into principal component space, then models silent and non-silent periods with a single Gaussian distribution and a Gaussian mixture distribution respectively for the decision rule.…”

Section: Introductionmentioning

confidence: 99%

“…The algorithm in [2] uses a filtered dynamic visual feature calculated from geometric visual features with multi-thresholds for silence detection. The approach in [3] estimates lip motion based on complex discrete wavelet transform, then applies the hidden Markov model for the statistical characterization of the lip motion, which is finally thresholded for the VAD. However, the algorithms above use either only static or only dynamic features, and the features used are fixed for different objects (speakers).…”

Section: Introductionmentioning

confidence: 99%

A visual voice activity detection method with adaboosting

Liu

Wang

Jackson

2011

Sensor Signal Processing for Defence (SSPD 2011)

View full text Add to dashboard Cite

Abstract-Spontaneous speech in videos capturing the speaker's mouth provides bimodal information. Exploiting the relationship between the audio and visual streams, we propose a new visual voice activity detection (VAD) algorithm, to overcome the vulnerability of conventional audio VAD techniques in the presence of background interference. First, a novel lip extraction algorithm combining rotational templates and prior shape constraints with active contours is introduced. The visual features are then obtained from the extracted lip region. Second, with the audio voice activity vector used in training, adaboosting is applied to the visual features, to generate a strong final voice activity classifier by boosting a set of weak classifiers. We have tested our lip extraction algorithm on the XM2VTS database (with higher resolution) and some video clips from YouTube (with lower resolution). The visual VAD was shown to offer low error rates.

show abstract

“…영상신호를 이용하는 연구는 주로 입술의 움직임을 이용 하는 것이며 [8] , 음성과 영상을 함께 이용하는 멀티모달 시 스템이 점차적으로 확산됨에 따라 음향잡음이 심한 환경에 서 영상신호를 이용하여 음성구간 검출 성능을 향상시키려 는 시도도 많이 이루어지고 있다 [9][10][11][12][13][14] . 영상신호를 이용한 음성구간 검출 알고리즘은 특징값을 추출하는 방법과 추출 한 특징값을 이용하여 음성 비음성을 판별하는 방법에 따 라 여러 방식의 알고리즘이 제안되어 왔다.…”

unclassified

“…Navarathna 등은 영상정보를 이용한 음성 인식에서 사용되었던 일련의 변환법을 사용하여 음성구간 검출에 적용하였는데 [13] , 여기서는 정적 특징값 뿐만 아니 라 동적 특징값을 사용하였고 가우시안 믹스처 모델링을 통해 음성/비음성 구간을 판별하였다. 또한, Aubrey 등은 입술의 움직임을 이용하여 음성구간을 검출하려고 시도하 였는데, 각 영상 프레임에서 구한 옵티컬 플로우의 변화를 HMM을 이용하여 모델링하였다 [14] . [14,15] .…”

unclassified

Voice Activity Detection using Motion and Variation of Intensity in The Mouth Region

Kim¹,

Ryu²,

Cho³

2012

Journal of Broadcast Engineering

View full text Add to dashboard Cite

Voice activity detection (VAD) is generally conducted by extracting features from the acoustic signal and a decision rule. The performance of such VAD algorithms driven by the input acoustic signal highly depends on the acoustic noise. When video signals are available as well, the performance of VAD can be enhanced by using the visual information which is not affected by the acoustic noise. Previous visual VAD algorithms usually use single visual feature to detect the lip activity, such as active appearance models, optical flow or intensity variation. Based on the analysis of the weakness of each feature, we propose to combine intensity change measure and the optical flow in the mouth region, which can compensate for each other's weakness. In order to minimize the computational complexity, we develop simple measures that avoid statistical estimation or modeling. Specifically, the optical flow is the averaged motion vector of some grid regions and the intensity variation is detected by simple thresholding. To extract the mouth region, we propose a simple algorithm which first detects two eyes and uses the profile of intensity to detect the center of mouth. Experiments show that the proposed combination of two simple measures show higher detection rates for the given false positive rate than the methods that use a single feature.

show abstract

Visual voice activity detection with optical flow

Cited by 30 publications

References 6 publications

Blind source separation and visual voice activity detection for target speech extraction

Blind source separation and visual voice activity detection for target speech extraction

A visual voice activity detection method with adaboosting

Voice Activity Detection using Motion and Variation of Intensity in The Mouth Region

Contact Info

Product

Resources

About