Interspeech 2017 2017
DOI: 10.21437/interspeech.2017-650
|View full text |Cite
|
Sign up to set email alerts
|

Improving Speech Recognizers by Refining Broadcast Data with Inaccurate Subtitle Timestamps

Abstract: This paper proposes an automatic method to refine broadcast data collected every week for efficient acoustic model training. For training acoustic models, we use only audio signals, subtitle texts, and subtitle timestamps accompanied by recorded broadcast programs. However, the subtitle timestamps are often inaccurate due to inherent characteristics of closed captioning. In the proposed method, we remove subtitle texts with low subtitle quality index, concatenate adjacent subtitle texts into a merged subtitle … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
4
3

Relationship

4
3

Authors

Journals

citations
Cited by 8 publications
(6 citation statements)
references
References 10 publications
0
6
0
Order By: Relevance
“…The data refinement method is mainly studied for audiobook [8]- [11], or broadcast data [3]- [5], [7]. Here, the broadcast data is easily used for data refinement, because it contains a lot of conventional speech generated by various speakers in diverse environments, and occasionally provides subtitle texts and their timestamps.…”
Section: Previous Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…The data refinement method is mainly studied for audiobook [8]- [11], or broadcast data [3]- [5], [7]. Here, the broadcast data is easily used for data refinement, because it contains a lot of conventional speech generated by various speakers in diverse environments, and occasionally provides subtitle texts and their timestamps.…”
Section: Previous Methodsmentioning
confidence: 99%
“…In this paper, we set the minimum length of consecutive matching words to 5, because the same words or word sequences that disturb text alignment frequently appear in broadcast programs. The method of selecting an anchor among the anchor candidates is the difference from the previous paper [7] that simply selected the longest matching subsequence as the anchor. The previous method possibly set the anchor at the unfortunate point because the same words and word sequences frequently appear in the broadcast data.…”
Section: Text Alignmentmentioning
confidence: 99%
See 1 more Smart Citation
“…We used about 1000 h of Korean broadcast data [25,31] for the subword unit derivation and the acoustic model training. This database was automatically constructed from broadcast audio and their subtitle text using a lightly supervised approach [32].…”
Section: Methodsmentioning
confidence: 99%
“…Each segment is extracted using the forced alignment algorithm [24], which is commonly applied to acoustic model training. Here, we used an acoustic model with a deep neural network (DNN) structure, as applied in the previous work [25], and 40-dimensional log-Mel filter-bank (FBank) features are spliced over time, with a context size of 15 frames (±7 frames). The extracted segments, which total about 50 million, consist of different numbers of frames, even if they have the same phoneme label.…”
Section: Segment Extractionmentioning
confidence: 99%