2010 IEEE International Conference on Acoustics, Speech and Signal Processing 2010
DOI: 10.1109/icassp.2010.5494913
|View full text |Cite
|
Sign up to set email alerts
|

Speech/non-speech detection in meetings from automatically extracted low resolution visual features

Abstract: In this paper we address the problem of estimating who is speaking from automatically extracted low resolution visual cues in group meetings. Traditionally, the task of speech/non-speech detection or speaker diarization tries to find "who speaks and when" from audio features only. In this paper, we investigate more systematically how speaking status can be estimated from low resolution video We exploit the synchrony of a group's head and hand motion to learn correspondences between speaking status and visual a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
17
0

Year Published

2012
2012
2024
2024

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 20 publications
(17 citation statements)
references
References 20 publications
0
17
0
Order By: Relevance
“…Hung and Ba [10] applied visual activity (the amount of movement) and focus of visual attention as features to determine who is the current speaker on real meeting room corpus data. Action units (AU) were used as input features to Hidden Markov Models (HMM) in Stefanov et al [11].…”
Section: Introductionmentioning
confidence: 99%
“…Hung and Ba [10] applied visual activity (the amount of movement) and focus of visual attention as features to determine who is the current speaker on real meeting room corpus data. Action units (AU) were used as input features to Hidden Markov Models (HMM) in Stefanov et al [11].…”
Section: Introductionmentioning
confidence: 99%
“…Other approaches for speaker detection include a general pattern recognition framework used by Besson and Kunt [30] applied to detection of the speaker in audio-visual sequences. Visual activity (the amount of movement) and focus of visual attention were used as inputs by Hung and Ba [31] to determine the current speaker on real meetings. Stefanov et al [32] used action units as inputs to Hidden Markov Models to determine the active speaker in multi-party interactions and Vajaria et al [33] demonstrated that information for body movements can improve the detection performance.…”
Section: B Active Speaker Detectionmentioning
confidence: 99%
“…Actually, cognitive scientists showed that speech and gestures are so tightly intertwined that every important investigation of language has taken gestures into account [51]. While the multimodal diarization is common in the literature (based on the joint modeling of speech, facial and bodily cues), the unimodal diarization exploiting visual information is rare [52], unrelated to surveillance. We think that this is a direction worth to be investigated, because this allows to capture turn taking patterns indicating ongoing conversations and thus genuine social interactions (PROBLEM 3).…”
Section: Gesture and Posturementioning
confidence: 99%