This paper describes the Speech Technology Center (STC) system for the 5th CHiME challenge. This challenge considers the problem of distant multi-microphone conversational speech recognition in everyday home environments. Our efforts were focused on the single-array track, however, we participated in the multiple-array track as well. The system is in the ranking A of the challenge: acoustic models remain frame-level tied phonetic targets, lexicon and language model are not changed compared to the conventional ASR baseline. Our system employs a combination of 4 acoustic models based on convolutional and recurrent neural networks. Speaker adaptation with target speaker masks and multi-channel speaker-aware acoustic model with neural network beamforming are two major features of the system. Moreover, various techniques for improving acoustic models are applied, including array synchronization, data cleanup, alignment transfer, mixup, speed perturbation data augmentation, room simulation, and backstitch training. Our system scored 3rd in the single-array track with Word Error Rate (WER) of 55.5% and 4th in the multiple-array track with WER of 55.6% on the evaluation data, achieving a substantial improvement over the baseline system.
This paper is a description of the Speech Technology Center (STC) automatic speech recognition (ASR) system for the "VOiCES from a Distance Challenge 2019". We participated in the Fixed condition of the ASR task, which means that the only training data available was an 80-hour subset of the Lib-riSpeech corpus. The main difficulty of the challenge is a mismatch between clean training data and distant noisy development/evaluation data. In order to tackle this, we applied room acoustics simulation and weighted prediction error (WPE) dereverberation. We also utilized well-known speaker adaptation using x-vector speaker embeddings, as well as novel room acoustics adaptation with R-vector room impulse response (RIR) embeddings. The system used a lattice-level combination of 6 acoustic models based on different pronunciation dictionaries and input features. N-best hypotheses were rescored with 3 neural network language models (NNLMs) trained on both words and sub-word units. NNLMs were also explored for out-of-vocabulary (OOV) words handling by means of artificial texts generation. The final system achieved Word Error Rate (WER) of 14.7% on the evaluation data, which is the best result in the challenge.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.