Techniques for multi-lingual and cross-lingual speech recognition can help in low resource scenarios, to bootstrap systems and enable analysis of new languages and domains. End-to-end approaches, in particular sequence-based techniques, are attractive because of their simplicity and elegance. While it is possible to integrate traditional multi-lingual bottleneck feature extractors as front-ends, we show that end-to-end multi-lingual training of sequence models is effective on context independent models trained using Connectionist Temporal Classification (CTC) loss. We show that our model improves performance on Babel languages by over 6% absolute in terms of word/phoneme error rate when compared to mono-lingual systems built in the same setting for these languages. We also show that the trained model can be adapted cross-lingually to an unseen language using just 25% of the target data. We show that training on multiple languages is important for very low resource cross-lingual target scenarios, but not for multi-lingual testing scenarios. Here, it appears beneficial to include large well prepared datasets.
In Automatic Speech Recognition, it is still challenging to learn useful intermediate representations when using high-level (or abstract) target units such as words. For that reason, when only a few hundreds of hours of training data are available, character or phoneme-based systems tend to outperform word-based systems. In this paper, we show how Hierarchical Multitask Learning can encourage the formation of useful intermediate representations. We achieve this by performing Connectionist Temporal Classification at different levels of the network with targets of different granularity. Our model thus performs predictions in multiple scales for the same input. On the standard 300h Switchboard training setup, our hierarchical multitask architecture demonstrates improvements over singletask architectures with the same number of parameters. Our model obtains 14.0% Word Error Rate on the Switchboard subset of the Eval2000 test set without any decoder or language model, outperforming the current state-of-the-art on non-autoregressive Acousticto-Word models. Index Termshierarchical multitask learning, ASR, CTC BiLSTM … CTC Loss BiLSTM n < l a t e x i t s h a 1 _ b a s e 6 4 = " I q W 2 N 3 6 Q k t B w n t G Q a v L h E 1 D L r r E = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E q M e i F 4 8 t 2 A 9 o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e A m J v O v y Z A r Z E Z M L a F M c X s r Y W O q K D M 2 m 5 I N w V t 9 e Z 2 0 r 6 q e W / W a 1 5 X 6 b R 5 H E c 7 g H C 7 B g x r U 4 R 4 a 0 A I G C M / w C m / O o / P i v D s f y 9 a C k 8 + c w h 8 4 n z / X E Y z y < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " I q W 2 N 3 6 Q k t B w n t G Q a v L h E 1 D L r r E = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E q M e i F 4 8 t 2 A 9 o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e A m J v O v y Z A r Z E Z M L a F M c X s r Y W O q K D M 2 m 5 I N w V t 9 e Z 2 0 r 6 q e W / W a 1 5 X 6 b R 5 H E c 7 g H C 7 B g x r U 4 R 4 a 0 A I G C M / w C m / O o / P i v D s f y 9 a C k 8 + c w h 8 4 n z / X E Y z y < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = "n 3 5 n 2 s u q p 5 6 9 L O 4 I + 8 z x 8 4 x I o 4 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " N t n m D L J 1 N g R z 1 8 Q x o C q I Y j L d 6 r c = " > A A A B 3 X i c b Z B L S w M x F I X v + K y 1 a n X r J l g E V 2 X G j S 4 F N y 5 b s A 9 o h 5 J J 7 7 S x m c y Q 3 B H K 0 F / g x o U i / i 1 3 / h v T x 0 J b D w Q + z k n I v S f K l L T k + 9 / e 1 v b O 7 t 5 + 6 a B 8 W D k 6 P q m e V t o 2 z Y 3 A l k h V a r o R t 6 i k x h Z J U t j N D P I k U t i J J v f z v P O M x s p U P 9 I 0 w z D h I y 1 j K T g 5 q 6 k H 1 Z p f 9 x d i m x C s o A Y r N Q b V r / 4 w F X m C m o T i 1 v Y C P 6 O w 4 I a k U D g r 9 7 v F w V P r J 0 m k b u Z c B r b ...
Connectionist Temporal Classification has recently attracted a lot of interest as it offers an elegant approach to building acoustic models (AMs) for speech recognition. The CTC loss function maps an input sequence of observable feature vectors to an output sequence of symbols. Output symbols are conditionally independent of each other under CTC loss, so a language model (LM) can be incorporated conveniently during decoding, retaining the traditional separation of acoustic and linguistic components in ASR.For fixed vocabularies, Weighted Finite State Transducers provide a strong baseline for efficient integration of CTC AMs with n-gram LMs. Character-based neural LMs provide a straight forward solution for open vocabulary speech recognition and all-neural models, and can be decoded with beam search. Finally, sequence-to-sequence models can be used to translate a sequence of individual sounds into a word string.We compare the performance of these three approaches, and analyze their error patterns, which provides insightful guidance for future research and development in this important area.
Transcription or sub-titling of open-domain videos is still a challenging domain for Automatic Speech Recognition (ASR) due to the data's challenging acoustics, variable signal processing and the essentially unrestricted domain of the data. In previous work, we have shown that the visual channel -specifically object and scene features -can help to adapt the acoustic model (AM) and language model (LM) of a recognizer, and we are now expanding this work to end-to-end approaches. In the case of a Connectionist Temporal Classification (CTC)-based approach, we retain the separation of AM and LM, while for a sequence-to-sequence (S2S) approach, both information sources are adapted together, in a single model. This paper also analyzes the behavior of CTC and S2S models on noisy video data (How-To corpus), and compares it to results on the clean Wall Street Journal (WSJ) corpus, providing insight into the robustness of both approaches.
This paper proposes a novel approach to create a unit set for CTC-based speech recognition systems. By using Byte-Pair Encoding we learn a unit set of arbitrary size on a given training text. In contrast to using characters or words as units, this allows us to find a good trade-off between the size of our unit set and the available training data. We investigate both crossword units, which may span multiple words, and subword units. By evaluating these unit sets with decoding methods using a separate language model, we are able to show improvements over a purely character-based unit set.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.