2012 IEEE 14th International Workshop on Multimedia Signal Processing (MMSP) 2012
DOI: 10.1109/mmsp.2012.6343426
|View full text |Cite
|
Sign up to set email alerts
|

Incorporation of the ASR output in speaker segmentation and clustering within the task of speaker diarization of broadcast streams

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
0

Year Published

2013
2013
2024
2024

Publication Types

Select...
6
1
1

Relationship

1
7

Authors

Journals

citations
Cited by 15 publications
(11 citation statements)
references
References 7 publications
0
10
0
Order By: Relevance
“…Standard metrics, such as the Word Error Rate (WER) and the Diarization Error Rate (DER) used in ASR and in diarization, respectively, are useful during modeling in order to have benchmarks and quantifiable areas of improvement. However, they do not necessarily reflect the transcript quality from a user's perspective (Silovsky, Zdansky, Nouza, Cerva, & Prazak, 2012) and they are not always representative of the performance with respect to semantics and to clinical impact (Miner et al, 2020). Qualitative surveys where experts share their opinions on the accuracy of the system output could assist highlighting specific areas of clinical importance on which the modeling efforts should focus.…”
Section: Metricmentioning
confidence: 99%
“…Standard metrics, such as the Word Error Rate (WER) and the Diarization Error Rate (DER) used in ASR and in diarization, respectively, are useful during modeling in order to have benchmarks and quantifiable areas of improvement. However, they do not necessarily reflect the transcript quality from a user's perspective (Silovsky, Zdansky, Nouza, Cerva, & Prazak, 2012) and they are not always representative of the performance with respect to semantics and to clinical impact (Miner et al, 2020). Qualitative surveys where experts share their opinions on the accuracy of the system output could assist highlighting specific areas of clinical importance on which the modeling efforts should focus.…”
Section: Metricmentioning
confidence: 99%
“…The LVCSR system used for the transcription of archive documents employs a two-pass strategy. The output of the first decoder pass is used for a) segmentation to speech and non-speech parts, b) synchronization of speaker change detector with word and noise boundaries [10], c) speaker clustering [11], and d) speaker adaptation via the CMLLR technique [12]. The first pass is usually performed with a smaller lexicon to reduce computational costs and time.…”
Section: Speech Transcription Systemmentioning
confidence: 99%
“…A special label (COM ) is assigned to those Czech and Slovak words that share the same orthography and pronunciation. For each speech segment (determined by the speaker change point detector [10]), we get the numbers of recognized Czech words (N CZ ), Slovak words (N SK ) and the common ones (N COM ).The utterance in the segment is identified as Czech or Slovak according to the higher of counts N CZ and N SK .…”
Section: Lvcsr With Merged Lexicons and Language Modelsmentioning
confidence: 99%
“…For example, the transcription of the input audio provides a strong clue to estimate the utterance boundaries. Several works were proposed to combine the automatic speech recognition (ASR) with speaker diarization, such as using the word boundary information from ASR [18,19] or improving the speaker segmentation and clustering based on the information from ASR [20,21]. While these works showed promising results, the ASR and speaker diarization models were separately trained.…”
Section: Introductionmentioning
confidence: 99%