Spoken language understanding (SLU) refers to the process of inferring the semantic information from audio signals.While the neural transformers consistently deliver the best performance among the state-of-the-art neural architectures in field of natural language processing (NLP), their merits in a closely related field, i.e., spoken language understanding (SLU) have not beed investigated. In this paper, we introduce an end-toend neural transformer-based SLU model that can predict the variable-length domain, intent, and slots vectors embedded in an audio signal with no intermediate token prediction architecture. This new architecture leverages the self-attention mechanism by which the audio signal is transformed to various subsubspaces allowing to extract the semantic context implied by an utterance. Our end-to-end transformer SLU predicts the domains, intents and slots in the Fluent Speech Commands dataset with accuracy equal to 98.1 %, 99.6 %, and 99.6 %, respectively and outperforms the SLU models that leverage a combination of recurrent and convolutional neural networks by 1.4 % while the size of our model is 25% smaller than that of these architectures. Additionally, due to independent sub-space projections in the self-attention layer, the model is highly parallelizable which makes it a good candidate for on-device SLU.
Multilingual ASR technology simplifies model training and deployment, but its accuracy is known to depend on the availability of language information at runtime. Since language identity is seldom known beforehand in real-world scenarios, it must be inferred on-the-fly with minimum latency. Furthermore, in voice-activated smart assistant systems, language identity is also required for downstream processing of ASR output. In this paper, we introduce streaming, end-to-end, bilingual systems that perform both ASR and language identification (LID) using the recurrent neural network transducer (RNN-T) architecture. On the input side, embeddings from pretrained acousticonly LID classifiers are used to guide RNN-T training and inference, while on the output side, language targets are jointly modeled with ASR targets. The proposed method is applied to two language pairs: English-Spanish as spoken in the United States, and English-Hindi as spoken in India. Experiments show that for English-Spanish, the bilingual joint ASR-LID architecture matches monolingual ASR and acoustic-only LID accuracies. For the more challenging (owing to within-utterance code switching) case of English-Hindi, English ASR and LID metrics show degradation. Overall, in scenarios where users switch dynamically between languages, the proposed architecture offers a promising simplification over running multiple monolingual ASR models and an LID classifier in parallel.
Conventional dynamic language switching enables seamless multilingual interactions by running several monolingual ASR systems in parallel and triggering the appropriate downstream components using a standalone language identification (LID) service. Since this solution is neither scalable nor cost-and memory-efficient, especially for on-device applications, we propose end-to-end, streaming, joint ASR-LID architectures based on the recurrent neural network transducer framework. Two key formulations are explored: (1) joint training using a unified output space for ASR and LID vocabularies, and (2) joint training viewed as multi-task optimization. We also evaluate the benefit of using auxiliary language information obtained on-thefly from an acoustic LID classifier. Experiments with the English-Hindi language pair show that: (a) multi-task architectures perform better overall, and (b) the best joint architecture surpasses monolingual ASR (6.4-9.2% word error rate reduction) and acoustic LID (53.9-56.1% error rate reduction) baselines while reducing the overall memory footprint by up to 46%.
No abstract
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.