Unified ASR system using LGM-based source separation, noise-robust feature extraction, and word hypothesis selection

Fujita, Yusuke; Takashima, Ryoichi; Homma, Takeshi; Ikeshita, Rintaro; Kawaguchi, Yohei; Sumiyoshi, Takashi; Endo, Takashi; Togami, Masahito

doi:10.1109/asru.2015.7404825

Cited by 13 publications

(14 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Again, these techniques operate by estimating the relative transfer function for the target speaker and the interfering sources from data. As expected, Bagchi et al (2015) and Fujita et al (2015) reported similar performance for these two techniques on real and simulated data. Single-channel enhancement based on nonnegative matrix factorization (NMF) of the power spectra of speech and noise has also been used and resulted in minor improvement on both real and simulated data (Bagchi et al, 2015;Vu et al, 2015).…”

Section: Source Separationsupporting

confidence: 62%

“…This claim also holds true for system combination based on recognizer output voting error reduction (ROVER) (Fiscus, 1997), as reported by Fujita et al (2015). This comes as no surprise as these techniques are somewhat orthogonal to acoustic modeling and they are either trained on separate material or do not rely on training at all.…”

Section: Language Modeling and Rover Fusionmentioning

confidence: 61%

“…This comes as no surprise as these techniques are somewhat orthogonal to acoustic modeling and they are either trained on separate material or do not rely on training at all. Fujita et al (2015) also proposed a discriminative word selection method to estimate correct words from the composite word transition network created in the ROVER process. They trained this method on the real development set and found it to improve the ASR performance on real data but to degrade it on simulated data.…”

Section: Language Modeling and Rover Fusionmentioning

confidence: 99%

See 2 more Smart Citations

An analysis of environment, microphone and data simulation mismatches in robust speech recognition

Vincent

Watanabe

Nugraha

et al. 2017

Computer Speech & Language

305

177

View full text Add to dashboard Cite

Section: Source Separationsupporting

confidence: 62%

Section: Language Modeling and Rover Fusionmentioning

confidence: 61%

Section: Language Modeling and Rover Fusionmentioning

confidence: 99%

See 1 more Smart Citation

An analysis of environment, microphone and data simulation mismatches in robust speech recognition

Vincent

Watanabe

Nugraha

et al. 2017

Computer Speech & Language

305

177

View full text Add to dashboard Cite

“…The simplest approach has been to apply utterance-based feature mean and variance normalization (Zhao et al, 2015;Fujita et al, 2015;Du et al, 2015;Wang et al, 2015). However, the two most effective techniques are transforming the DNN features using feature-space maximum likelihood linear regression (fMLLR) (Hori et al, 2015;Moritz et al, 2015;Vu et al, 2015;Sivasankaran et al, 2015;Tran et al, unpublished) or augmentation of the DNN features using either i-vectors, (e.g., Moritz et al, 2015;Zhuang et al, 2015), pitch-based features (Ma et al, 2015;Wang et al, 2015;Du et al, 2015) or bottleneck features (Tachioka et al, 2015), i.e., extracted from bottleneck layers in speaker classification DNNs.…”

Section: Feature Designmentioning

confidence: 99%

“…In Fujita et al (2015) and Zhao et al (2015), DNN filterbank features have been supplemented by delta and delta-delta features. Zhao et al (2015) show that this provides a significant improvement over using the 11 frames of filterbank features alone.…”

Section: Feature Designmentioning

confidence: 99%

The third ‘CHiME’ speech separation and recognition challenge: Analysis and outcomes

Barker

Marxer

Vincent

et al. 2017

Computer Speech & Language

105

View full text Add to dashboard Cite

This paper presents the design and outcomes of the CHiME-3 challenge, the first open speech recognition evaluation designed to target the increasingly relevant multichannel, mobile-device speech recognition scenario. The paper serves two purposes. First, it provides a definitive reference for the challenge, including full descriptions of the task design, data capture and baseline systems along with a description and evaluation of the 26 systems that were submitted. The best systems re-engineered every stage of the baseline resulting in reductions in word error rate from 33.4% to as low as 5.8%. By comparing across systems, techniques that are essential for strong performance are identified. Second, the paper considers the problem of drawing conclusions from evaluations that use speech directly recorded in noisy environments. The degree of challenge presented by the resulting material is hard to control and hard to fully characterise. We attempt to dissect the various 'axes of difficulty' by correlating various estimated signal properties with typical system performance on a per session and per utterance basis. We find strong evidence of a dependence on signal-to-noise ratio and channel quality. Systems are less sensitive to variations in the degree of speaker motion. The paper concludes by discussing the outcomes of CHiME-3 in relation to the design of future mobile speech recognition evaluations.

show abstract

The CHiME Challenges: Robust Speech Recognition in Everyday Environments

Barker

Marxer

Watanabe

2017

New Era for Robust Speech Recognition

View full text Add to dashboard Cite

The CHiME challenge series has been aiming to advance the development of robust automatic speech recognition for use in everyday environments by encouraging research at the interface of signal processing and statistical modelling. The series has been running since 2011 and is now entering its 4th iteration. This chapter provides an overview of the CHiME series including a description of the datasets that have been collected and the tasks that have been defined for each edition. In particular the chapter describes novel approaches that have been developed for producing simulated data for system training and evaluation, and conclusions about the validity of using simulated data for robust speech recognition development. We also provide a brief overview of the systems and specific techniques that have proved successful for each task. These systems have demonstrated the remarkable robustness that can be achieved through a combination of training data simulation and multicondition training, well-engineered multichannel enhancement and state-of-the-art discriminative acoustic and language modelling techniques.

show abstract

Unified ASR system using LGM-based source separation, noise-robust feature extraction, and word hypothesis selection

Cited by 13 publications

References 17 publications

An analysis of environment, microphone and data simulation mismatches in robust speech recognition

An analysis of environment, microphone and data simulation mismatches in robust speech recognition

The third ‘CHiME’ speech separation and recognition challenge: Analysis and outcomes

The CHiME Challenges: Robust Speech Recognition in Everyday Environments

Contact Info

Product

Resources

About