2019
DOI: 10.48550/arxiv.1906.10042
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Who said that?: Audio-visual speaker diarisation of real-world meetings

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
5
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
2
1
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(5 citation statements)
references
References 29 publications
0
5
0
Order By: Relevance
“…In contrast, utilizing a multi-modal approach by adding video data has been shown to outperform audio-only analysis in the field of speech processing [21]. Thus, recent studies have often leveraged a multi-stage approach and incorporated video data, rather than using a single end-to-end model [22][23][24][25][26][27]. The method proposed by Yoshioka et al [22] uses face tracking and identification, sound source localization, and speaker identification, yet it requires multi-channel audio input.…”
Section: Multi-modal Speaker Diarizationmentioning
confidence: 99%
See 1 more Smart Citation
“…In contrast, utilizing a multi-modal approach by adding video data has been shown to outperform audio-only analysis in the field of speech processing [21]. Thus, recent studies have often leveraged a multi-stage approach and incorporated video data, rather than using a single end-to-end model [22][23][24][25][26][27]. The method proposed by Yoshioka et al [22] uses face tracking and identification, sound source localization, and speaker identification, yet it requires multi-channel audio input.…”
Section: Multi-modal Speaker Diarizationmentioning
confidence: 99%
“…After that, the approach uses face verification to classify whether the recognized face belongs to a specific individual. However, this requires prior knowledge such as the number and images of the speakers, see also Chung et al [24,27]. To solve this problem, this step was replaced by the clustering of face tracks in the latest version [25].…”
Section: Multi-modal Speaker Diarizationmentioning
confidence: 99%
“…Thus, the most reliable and least intruding way of data collection uses one 360°room camera and a room microphone. Recent studies often leveraged a multi-stage approach when incorporating video data, rather than using a single end-to-end model [12][13][14][15][16][17]. The method proposed by Yoshioka et al [12] uses face tracking and identification, sound source localization, and speaker identification, yet it requires multi-channel audio input.…”
Section: Multi-modal Speaker Diarizationmentioning
confidence: 99%
“…After that, the approach uses face verification to classify whether the recognized face belongs to a specific individual. However, this requires prior knowledge such as the number and images of the speakers, see also Chung et al [14,17]. To solve this problem, this step was replaced by the clustering of face tracks in the latest version [15].…”
Section: Multi-modal Speaker Diarizationmentioning
confidence: 99%
“…McDorman states that the context of what is verbally discussed in a meeting is important [22]. Previous work on verbal information analysis in meetings often focuses on creating transcriptions of the meetings [23], [24], [25].…”
Section: A Meeting Behavior Analysismentioning
confidence: 99%