Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-2057
|View full text |Cite
|
Sign up to set email alerts
|

Subword and Crossword Units for CTC Acoustic Models

Abstract: This paper proposes a novel approach to create a unit set for CTC-based speech recognition systems. By using Byte-Pair Encoding we learn a unit set of arbitrary size on a given training text. In contrast to using characters or words as units, this allows us to find a good trade-off between the size of our unit set and the available training data. We investigate both crossword units, which may span multiple words, and subword units. By evaluating these unit sets with decoding methods using a separate language m… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
22
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
4
2
2

Relationship

2
6

Authors

Journals

citations
Cited by 28 publications
(23 citation statements)
references
References 31 publications
1
22
0
Order By: Relevance
“…The improvements of the addition of a LM, however, banish as the number of units increases. This behavior that we also observe in [12], is consistent in MTL models.…”
Section: Evaluation Of Multitask Learning Architecturessupporting
confidence: 90%
See 1 more Smart Citation
“…The improvements of the addition of a LM, however, banish as the number of units increases. This behavior that we also observe in [12], is consistent in MTL models.…”
Section: Evaluation Of Multitask Learning Architecturessupporting
confidence: 90%
“…On the other hand, big units are less flexible and usually easier and faster to decode. In [12] we observe that each type of target unit contributes differently to the final output. As a possible solution, Chan et al and Liu et al propose to learn the best possible decomposition of units [13,14].…”
Section: Related Workmentioning
confidence: 84%
“…However, these models have to be very deep (e.g., 17-19 convolutional layers on LibriSpeech [23]) to cover the same context (Table 1). While in theory, a relatively local context could suffices for ASR, this is complicated by alphabets L which violate the conditional independence assumption of CTC (e.g., English characters [36]). Wide contexts also enable incorporation of noise/speaker contexts, as [27] suggest regarding the broad-context attention heads in the first layer of their self-attentional LAS model.…”
Section: Recurrent and Convolutional Modelsmentioning
confidence: 99%
“…2 The Enc-Dec+CTC model is comparable, taking almost a week on an older GPU (GTX 1080 Ti) to do its ∼12.5 full passes over the data. 3 Finally, we trained the same model with BPE subwords as CTC targets, to get more context-independent units [36]. We did 300 merge operations (10k was unstable) and attained a CER of 7.4%.…”
Section: Modelmentioning
confidence: 99%
“…Recently, notable progress has been made towards building direct A2W models using CTC [8,9,10,11] but it either requires large training data [9,10,12] or smaller vocabulary [1, 8,11]. In this paper, we present one such approach using no more than 300 hours of training data but with a S2S model instead of a CTC model.…”
Section: Introductionmentioning
confidence: 99%