2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2018
DOI: 10.1109/icassp.2018.8461870
|View full text |Cite
|
Sign up to set email alerts
|

The Microsoft 2017 Conversational Speech Recognition System

Abstract: We describe the 2017 version of Microsoft's conversational speech recognition system, in which we update our 2016 system with recent developments in neural-network-based acoustic and language modeling to further advance the state of the art on the Switchboard speech recognition task. The system adds a CNN-BLSTM acoustic model to the set of model architectures we combined previously, and includes character-based and dialog session aware LSTM language models in rescoring. For system combination we adopt a twosta… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

3
240
0
3

Year Published

2019
2019
2023
2023

Publication Types

Select...
6
4

Relationship

0
10

Authors

Journals

citations
Cited by 385 publications
(246 citation statements)
references
References 49 publications
3
240
0
3
Order By: Relevance
“…Automatic Speech Recognition (ASR) is a key technology for the task of automatic analysis of any kind of spoken speech, e.g., phone calls or meetings. For scenarios of relatively clean speech, e.g., recordings of telephone speech or audio books, ASR technologies have improved drastically over the recent years [1]. More realistic scenarios like spontaneous speech or meetings with multiple participants in many cases require the ASR system to recognize the speech of multiple speakers simultaneously.…”
Section: Introductionmentioning
confidence: 99%
“…Automatic Speech Recognition (ASR) is a key technology for the task of automatic analysis of any kind of spoken speech, e.g., phone calls or meetings. For scenarios of relatively clean speech, e.g., recordings of telephone speech or audio books, ASR technologies have improved drastically over the recent years [1]. More realistic scenarios like spontaneous speech or meetings with multiple participants in many cases require the ASR system to recognize the speech of multiple speakers simultaneously.…”
Section: Introductionmentioning
confidence: 99%
“…The later state-of-theart-model, DenseNet [10], also uses SC and BN. Besides success in computer vision, ResNet has also performed well in acoustic models for speech recognition [11,12].…”
Section: Introductionmentioning
confidence: 99%
“…Table 1 for descriptions. Improvements in speech recognition [28], dialogue generation [12,24], emotional speech synthesis [17,26] and computer graphics have made it possible to design more expressive and realistic conversational agents. However, there are still many uncertainties in how best to design embodied agents.…”
Section: Introductionmentioning
confidence: 99%