Sri Harish Mallidi scite author profile

Sri Harish Mallidi

5Publications

210Citation Statements Received

80Citation Statements Given

How they've been cited

240

204

How they cite others

Affiliations

Amazon (United States), Johns Hopkins University, International Institute of Information Technology, Hyderabad

Publications

Order By: Most citations

Multilingual Sequence-to-Sequence Speech Recognition: Architecture, Transfer Learning, and Language Modeling

Cho

Baskar

et al. 2018

110

View full text Add to dashboard Cite

Sequence-to-sequence (seq2seq) approach for low-resource ASR is a relatively new direction in speech research. The approach benefits by performing model training without using lexicon and alignments. However, this poses a new problem of requiring more data compared to conventional DNN-HMM systems. In this work, we attempt to use data from 10 BABEL languages to build a multilingual seq2seq model as a prior model, and then port them towards 4 other BABEL languages using transfer learning approach. We also explore different architectures for improving the prior multilingual seq2seq model. The paper also discusses the effect of integrating a recurrent neural network language model (RNNLM) with a seq2seq model during decoding. Experimental results show that the transfer learning approach from the multilingual model shows substantial gains over monolingual models across all 4 BABEL languages. Incorporating an RNNLM also brings significant improvements in terms of %WER, and achieves recognition performance comparable to the models trained with twice more training data.

show abstract

A Study for Improving Device-Directed Speech Detection Toward Frictionless Human-Machine Interaction

Huang

Maas

Mallidi

et al. 2019

View full text Add to dashboard Cite

Device-directed Utterance Detection

Mallidi

Maas

Goehner

et al. 2018

View full text Add to dashboard Cite

In this work, we propose a classifier for distinguishing device-directed queries from background speech in the context of interactions with voice assistants. Applications include rejection of false wake-ups or unintended interactions as well as enabling wake-word free followup queries. Consider the example interaction: "Computer, play music", "Computer, reduce the volume". In this interaction, the user needs to repeat the wake-word (Computer) for the second query. To allow for more natural interactions, the device could immediately re-enter listening state after the first query (without wake-word repetition) and accept or reject a potential follow-up as device-directed or background speech. The proposed model consists of two long short-term memory (LSTM) neural networks trained on acoustic features and automatic speech recognition (ASR) 1-best hypotheses, respectively. A feed-forward deep neural network (DNN) is then trained to combine the acoustic and 1-best embeddings, derived from the LSTMs, with features from the ASR decoder. Experimental results show that ASR decoder, acoustic embeddings, and 1-best embeddings yield an equal-error-rate (EER) of 9.3 %, 10.9 % and 20.1 %, respectively. Combination of the features resulted in a 44 % relative improvement and a final EER of 5.2 %.

show abstract

Robust Feature Extraction Using Modulation Filtering of Autoregressive Models

Ganapathy

Mallidi

Heřmanský

2014

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Developing a speaker identification system for the DARPA RATS project

Plchot

Matsoukas

Matějka

et al. 2013

View full text Add to dashboard Cite

This paper describes the speaker identification (SID) system developed by the Patrol team for the first phase of the DARPA RATS (Robust Automatic Transcription of Speech) program, which seeks to advance state of the art detection capabilities on audio from highly degraded communication channels. We present results using multiple SID systems differing mainly in the algorithm used for voice activity detection (VAD) and feature extraction. We show that (a) unsupervised VAD performs as well supervised methods in terms of downstream SID performance, (b) noise-robust feature extraction methods such as CFCCs out-perform MFCC front-ends on noisy audio, and (c) fusion of multiple systems provides 24% relative improvement in EER compared to the single best system when using a novel SVM-based fusion algorithm that uses side information such as gender, language, and channel id.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.