ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053171
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal Active Speaker Detection and Virtual Cinematography for Video Conferencing

Abstract: Active speaker detection (ASD) and virtual cinematography (VC) can significantly improve the remote user experience of a video conference by automatically panning, tilting and zooming of a video conferencing camera: users subjectively rate an expert video cinematographer's video significantly higher than unedited video. We describe a new automated ASD and VC that performs within 0.3 MOS of an expert cinematographer based on subjective ratings with a 1-5 scale. This system uses a 4K wide-FOV camera, a depth cam… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 8 publications
(2 citation statements)
references
References 12 publications
0
2
0
Order By: Relevance
“…It has a long history in computer vision [13]. ASD is used in various applications such as automatic video annotation [64], video conferencing [1], human-robot interactions, speech recognition, speaker diarization or re-framing video [11], speech transcription, speech enhancement [13], tracking storylines and characters in narrative content [13], [15] and facilitate the mining of training data for modeling speaker identification [16], [17]. In video conferencing, ASD allows the far-end participants to see who is currently speaking, which is especially useful when the conference room is large or the remote video is rendered on a small display due to small screen size, small render size, or limited bandwidth [1], [13].…”
Section: Introductionmentioning
confidence: 99%
“…It has a long history in computer vision [13]. ASD is used in various applications such as automatic video annotation [64], video conferencing [1], human-robot interactions, speech recognition, speaker diarization or re-framing video [11], speech transcription, speech enhancement [13], tracking storylines and characters in narrative content [13], [15] and facilitate the mining of training data for modeling speaker identification [16], [17]. In video conferencing, ASD allows the far-end participants to see who is currently speaking, which is especially useful when the conference room is large or the remote video is rendered on a small display due to small screen size, small render size, or limited bandwidth [1], [13].…”
Section: Introductionmentioning
confidence: 99%
“…Active Speaker Detection (ASD) refers to the task of identifying when each visible person is speaking in a video, typically through careful joint analysis of face motion and voices. It has a wide range of modern practical applications, such as video re-targeting [14], What marks an active speaker out from others? Admittedly, face motion and its synchrony with the audio are the most obvious clues; however, as shown in the figure above, the underlying signals can be highly ambiguous, especially in hard scenarios with poorly lit, low-resolution faces and noisy acoustics, etc.…”
Section: Introductionmentioning
confidence: 99%