Subword and Crossword Units for CTC Acoustic Models

Zenkel, Thomas; Sanabria, Ramon; Metze, Florian; Waibel, Alex

doi:10.21437/interspeech.2018-2057

Cited by 28 publications

(23 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The improvements of the addition of a LM, however, banish as the number of units increases. This behavior that we also observe in [12], is consistent in MTL models.…”

Section: Evaluation Of Multitask Learning Architecturessupporting

confidence: 90%

See 1 more Smart Citation

Hierarchical Multitask Learning With CTC

Sanabria

Metze

2018

2018 IEEE Spoken Language Technology Workshop (SLT)

Self Cite

View full text Add to dashboard Cite

In Automatic Speech Recognition, it is still challenging to learn useful intermediate representations when using high-level (or abstract) target units such as words. For that reason, when only a few hundreds of hours of training data are available, character or phoneme-based systems tend to outperform word-based systems. In this paper, we show how Hierarchical Multitask Learning can encourage the formation of useful intermediate representations. We achieve this by performing Connectionist Temporal Classification at different levels of the network with targets of different granularity. Our model thus performs predictions in multiple scales for the same input. On the standard 300h Switchboard training setup, our hierarchical multitask architecture demonstrates improvements over singletask architectures with the same number of parameters. Our model obtains 14.0% Word Error Rate on the Switchboard subset of the Eval2000 test set without any decoder or language model, outperforming the current state-of-the-art on non-autoregressive Acousticto-Word models. Index Termshierarchical multitask learning, ASR, CTC BiLSTM … CTC Loss BiLSTM n < l a t e x i t s h a 1 _ b a s e 6 4 = " I q W 2 N 3 6 Q k t B w n t G Q a v L h E 1 D L r r E = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E q M e i F 4 8 t 2 A 9 o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e A m J v O v y Z A r Z E Z M L a F M c X s r Y W O q K D M 2 m 5 I N w V t 9 e Z 2 0 r 6 q e W / W a 1 5 X 6 b R 5 H E c 7 g H C 7 B g x r U 4 R 4 a 0 A I G C M / w C m / O o / P i v D s f y 9 a C k 8 + c w h 8 4 n z / X E Y z y < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " I q W 2 N 3 6 Q k t B w n t G Q a v L h E 1 D L r r E = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E q M e i F 4 8 t 2 A 9 o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e A m J v O v y Z A r Z E Z M L a F M c X s r Y W O q K D M 2 m 5 I N w V t 9 e Z 2 0 r 6 q e W / W a 1 5 X 6 b R 5 H E c 7 g H C 7 B g x r U 4 R 4 a 0 A I G C M / w C m / O o / P i v D s f y 9 a C k 8 + c w h 8 4 n z / X E Y z y < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = "n 3 5 n 2 s u q p 5 6 9 L O 4 I + 8 z x 8 4 x I o 4 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " N t n m D L J 1 N g R z 1 8 Q x o C q I Y j L d 6 r c = " > A A A B 3 X i c b Z B L S w M x F I X v + K y 1 a n X r J l g E V 2 X G j S 4 F N y 5 b s A 9 o h 5 J J 7 7 S x m c y Q 3 B H K 0 F / g x o U i / i 1 3 / h v T x 0 J b D w Q + z k n I v S f K l L T k + 9 / e 1 v b O 7 t 5 + 6 a B 8 W D k 6 P q m e V t o 2 z Y 3 A l k h V a r o R t 6 i k x h Z J U t j N D P I k U t i J J v f z v P O M x s p U P 9 I 0 w z D h I y 1 j K T g 5 q 6 k H 1 Z p f 9 x d i m x C s o A Y r N Q b V r / 4 w F X m C m o T i 1 v Y C P 6 O w 4 I a k U D g r 9 7 v F w V P r J 0 m k b u Z c B r b ...

show abstract

“…The improvements of the addition of a LM, however, banish as the number of units increases. This behavior that we also observe in [12], is consistent in MTL models.…”

Section: Evaluation Of Multitask Learning Architecturessupporting

confidence: 90%

“…On the other hand, big units are less flexible and usually easier and faster to decode. In [12] we observe that each type of target unit contributes differently to the final output. As a possible solution, Chan et al and Liu et al propose to learn the best possible decomposition of units [13,14].…”

Section: Related Workmentioning

confidence: 84%

Hierarchical Multitask Learning With CTC

Sanabria

Metze

2018

2018 IEEE Spoken Language Technology Workshop (SLT)

Self Cite

View full text Add to dashboard Cite

show abstract

“…However, these models have to be very deep (e.g., 17-19 convolutional layers on LibriSpeech [23]) to cover the same context (Table 1). While in theory, a relatively local context could suffices for ASR, this is complicated by alphabets L which violate the conditional independence assumption of CTC (e.g., English characters [36]). Wide contexts also enable incorporation of noise/speaker contexts, as [27] suggest regarding the broad-context attention heads in the first layer of their self-attentional LAS model.…”

Section: Recurrent and Convolutional Modelsmentioning

confidence: 99%

“…2 The Enc-Dec+CTC model is comparable, taking almost a week on an older GPU (GTX 1080 Ti) to do its ∼12.5 full passes over the data. 3 Finally, we trained the same model with BPE subwords as CTC targets, to get more context-independent units [36]. We did 300 merge operations (10k was unstable) and attained a CER of 7.4%.…”

Section: Modelmentioning

confidence: 99%

Self-attention Networks for Connectionist Temporal Classification in Speech Recognition

Salazar

Kirchhoff

Huang

2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

125

View full text Add to dashboard Cite

The success of self-attention in NLP has led to recent applications in end-to-end encoder-decoder architectures for speech recognition. Separately, connectionist temporal classification (CTC) has matured as an alignment-free, non-autoregressive approach to sequence transduction, either by itself or in various multitask and decoding frameworks. We propose SAN-CTC, a deep, fully self-attentional network for CTC, and show it is tractable and competitive for end-toend speech recognition. SAN-CTC trains quickly and outperforms existing CTC models and most encoder-decoder models, with character error rates (CERs) of 4.7% in 1 day on WSJ eval92 and 2.8% in 1 week on LibriSpeech test-clean, with a fixed architecture and one GPU. Similar improvements hold for WERs after LM decoding. We motivate the architecture for speech, evaluate position and downsampling approaches, and explore how label alphabets (character, phoneme, subword) affect attention heads and performance.

show abstract

“…Recently, notable progress has been made towards building direct A2W models using CTC [8,9,10,11] but it either requires large training data [9,10,12] or smaller vocabulary [1, 8,11]. In this paper, we present one such approach using no more than 300 hours of training data but with a S2S model instead of a CTC model.…”

Section: Introductionmentioning

confidence: 99%

Acoustic-to-Word Recognition with Sequence-to-Sequence Models

Palaskar

Metze

2018

2018 IEEE Spoken Language Technology Workshop (SLT)

Self Cite

View full text Add to dashboard Cite

Acoustic-to-Word recognition provides a straightforward solution to end-to-end speech recognition without needing external decoding, language model re-scoring or lexicon. While character-based models offer a natural solution to the out-ofvocabulary problem, word models can be simpler to decode and may also be able to directly recognize semantically meaningful units. We present effective methods to train Sequenceto-Sequence models for direct word-level recognition (and character-level recognition) and show an absolute improvement of 4.4-5.0% in Word Error Rate on the Switchboard corpus compared to prior work. In addition to these promising results, word-based models are more interpretable than character models, which have to be composed into words using a separate decoding step. We analyze the encoder hidden states and the attention behavior, and show that locationaware attention naturally represents words as a single speechword-vector, despite spanning multiple frames in the input. We finally show that the Acoustic-to-Word model also learns to segment speech into words with a mean standard deviation of 3 frames as compared with human annotated forcedalignments for the Switchboard corpus.

show abstract

Subword and Crossword Units for CTC Acoustic Models

Cited by 28 publications

References 31 publications

Hierarchical Multitask Learning With CTC

Hierarchical Multitask Learning With CTC

Self-attention Networks for Connectionist Temporal Classification in Speech Recognition

Acoustic-to-Word Recognition with Sequence-to-Sequence Models

Contact Info

Product

Resources

About