On-line multi-modal speaker diarization

Noulas, A.; Kröse, B.J.A.

doi:10.1145/1322192.1322254

Cited by 22 publications

(17 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Though the results seem comparable to the state-of-the-art, the solution requires specialized hardware. The work presented in [106] integrates audiovisual features for on-line audiovisual speaker diarization using a dynamic Bayesian network (DBN) but tests were limited to discussions with two to three people on two short test scenarios. Another use of DBN, also called factorial HMMs [107], is proposed in [108] as an audiovisual framework.…”

Section: Overlap Detectionmentioning

confidence: 99%

Speaker Diarization: A Review of Recent Research

Anguera

Bozonnet²,

Evans³

et al. 2012

IEEE Trans. Audio Speech Lang. Process.

570

349

View full text Add to dashboard Cite

Abstract-Speaker diarization is the task of determining "who spoke when?" in an audio or video recording that contains an unknown amount of speech and also an unknown number of speakers. Initially, it was proposed as a research topic related to automatic speech recognition, where speaker diarization serves as an upstream processing step. Over recent years, however, speaker diarization has become an important key technology for many tasks, such as navigation, retrieval, or higher-level inference on audio data. Accordingly, many important improvements in accuracy and robustness have been reported in journals and conferences in the area. The application domains, from broadcast news, to lectures and meetings, vary greatly and pose different problems, such as having access to multiple microphones and multimodal information or overlapping speech. The most recent review of existing technology dates back to 2006 and focuses on the broadcast news domain. In this paper we review the current state-of-the-art, focusing on research developed since 2006 that relates predominantly to speaker diarization for conference meetings. Finally, we present an analysis of speaker diarization performance as reported through the NIST Rich Transcription evaluations on meeting data and identify important areas for future research.

show abstract

Section: Overlap Detectionmentioning

confidence: 99%

Speaker Diarization: A Review of Recent Research

Anguera

Bozonnet²,

Evans³

et al. 2012

IEEE Trans. Audio Speech Lang. Process.

570

349

View full text Add to dashboard Cite

show abstract

“…However, it is vulnerable to errors during periods of overlapping speech, even when multiple audio sources are used to estimate delays between captured audio signals. One solution is use visual cues to solve the problem audio-visually [8,9,10] but improvements are not always consistent so it is difficult to conclude how when they are useful.…”

Section: Introductionmentioning

confidence: 99%

“…Much previous work that exploit temporal correspondences between speech and vision have tended to assume that the motion from the mouth is the principal visual manifestation of speech [11,8]. However, there is much evidence from both social psychology [12] and computational methods [13,9] to suggest that speaking in conversations can manifest itself in broader body motions, which psychologists suggest aid cognitive communicative processes [12].…”

Section: Introductionmentioning

confidence: 99%

Speech/non-speech detection in meetings from automatically extracted low resolution visual features

Hung

2010

2010 IEEE International Conference on Acoustics, Speech and Signal Processing

View full text Add to dashboard Cite

In this paper we address the problem of estimating who is speaking from automatically extracted low resolution visual cues in group meetings. Traditionally, the task of speech/non-speech detection or speaker diarization tries to find "who speaks and when" from audio features only. In this paper, we investigate more systematically how speaking status can be estimated from low resolution video We exploit the synchrony of a group's head and hand motion to learn correspondences between speaking status and visual activity. We also carry out experiments to evaluate how context through the observation of group behaviour and task-oriented activities can help to improve estimates of speaking status. We test on 105 minutes of natural meeting data with unconstrained conversations and compare with state of the art audio-only methods.Index Terms-visual focus of attention, speaker detection.

show abstract

“…Noulas and Krose [6] investigated an on-line multimodal speaker diarisation system based on dymamic Bayesian networks and audio-visual mutual information in a constrained setting (videos of two seating persons speaking in turns). An interesting two steps real-time multimodal system to analyse group meetings was proposed by Otsuka et al [8].…”

Section: Introductionmentioning

confidence: 99%

“…Note that our data are more challenging than those used in [6,8] (where participants were assumed always seated in front of the camera). The main contribution of this paper is to exploit the role of gaze in coordinating turn-taking, by adopting a novel feature set based on Visual Focus of Attention (VFoA) to improve the speaker diarisation.…”

Section: Introductionmentioning

confidence: 99%

Investigating the use of visual focus of attention for audio-visual speaker diarisation

Garau

Bourlard

et al. 2009

Proceedings of the 17th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Audio-visual speaker diarisation is the task of estimating "who spoke when" using audio and visual cues. In this paper we propose the combination of an audio diarisation system with psychology inspired visual features, reporting experiments on multiparty meetings, a challenging domain characterised by unconstrained interaction and participant movements. More precisely the role of gaze in coordinating speaker turns was exploited by the use of Visual Focus of Attention features. Experiments were performed both with the reference and 3 automatic VFoA estimation systems, based on head pose and visual activity cues, of increasing complexity. VFoA features yielded consistent speaker diarisation improvements in combination with audio features using a multi-stream approach.

show abstract

On-line multi-modal speaker diarization

Cited by 22 publications

References 8 publications

Speaker Diarization: A Review of Recent Research

Speaker Diarization: A Review of Recent Research

Speech/non-speech detection in meetings from automatically extracted low resolution visual features

Investigating the use of visual focus of attention for audio-visual speaker diarisation

Contact Info

Product

Resources

About