2019
DOI: 10.48550/arxiv.1912.04979
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Advances in Online Audio-Visual Meeting Transcription

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2

Citation Types

0
4
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 9 publications
(4 citation statements)
references
References 50 publications
0
4
0
Order By: Relevance
“…In contrast, utilizing a multi-modal approach by adding video data has been shown to outperform audio-only analysis in the field of speech processing [21]. Thus, recent studies have often leveraged a multi-stage approach and incorporated video data, rather than using a single end-to-end model [22][23][24][25][26][27]. The method proposed by Yoshioka et al [22] uses face tracking and identification, sound source localization, and speaker identification, yet it requires multi-channel audio input.…”
Section: Multi-modal Speaker Diarizationmentioning
confidence: 99%
See 1 more Smart Citation
“…In contrast, utilizing a multi-modal approach by adding video data has been shown to outperform audio-only analysis in the field of speech processing [21]. Thus, recent studies have often leveraged a multi-stage approach and incorporated video data, rather than using a single end-to-end model [22][23][24][25][26][27]. The method proposed by Yoshioka et al [22] uses face tracking and identification, sound source localization, and speaker identification, yet it requires multi-channel audio input.…”
Section: Multi-modal Speaker Diarizationmentioning
confidence: 99%
“…Thus, recent studies have often leveraged a multi-stage approach and incorporated video data, rather than using a single end-to-end model [22][23][24][25][26][27]. The method proposed by Yoshioka et al [22] uses face tracking and identification, sound source localization, and speaker identification, yet it requires multi-channel audio input. Another method, initially introduced by Nagrani et al [23], first performs face detection and tracking, and then uses active speaker detection (ASD) to determine the synchronization between the mouth movement and speech in the video to identify the speaker.…”
Section: Multi-modal Speaker Diarizationmentioning
confidence: 99%
“…Many recent source separation systems assume that the number of active sources in a mixture is known in advance during both training and inference phases [1][2][3][4][5][6][7][8][9][10][11][12][13][14]. Such assumption can be valid when there is additional information, such as visual cue or source locations [15,16], however for a general blind source separation system it is typically not straightforward to obtain such information, especially in inference phase. In problems such as the separation of shorter streams or chunks in a long mixture, e.g.…”
Section: Introductionmentioning
confidence: 99%
“…In this paper, we propose a simple training method based on the fixed-output assumption by designing proper training targets for the invalid outputs. We adopt the fixed-output-number assumption as in real-world conversations such as meeting scenarios, the maximum number of simultaneously active speakers is almost always fewer than three [15,24], thus a maximum number of speakers can typically be pre-assumed. Instead of using low-energy auxiliary targets for invalid outputs, we use the mixture itself as auxiliary targets to force the invalid outputs to perform autoencoding.…”
Section: Introductionmentioning
confidence: 99%