Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-2770
|View full text |Cite
|
Sign up to set email alerts
|

Improved Hybrid Streaming ASR with Transformer Language Models

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
17
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
6
1

Relationship

1
6

Authors

Journals

citations
Cited by 11 publications
(17 citation statements)
references
References 15 publications
0
17
0
Order By: Relevance
“…Afterwards, the mean is dynamically updated for every new frame. In previous works, we proved that two seconds of initial delay should be enough to achieve similar performance to FSN [27], [28]. Although, two seconds of delay could be reasonable in a continuous streaming setup, it could be not so suitable for short utterances such as voice commands.…”
Section: Acoustic Feature Normalization For Streamingmentioning
confidence: 89%
See 1 more Smart Citation
“…Afterwards, the mean is dynamically updated for every new frame. In previous works, we proved that two seconds of initial delay should be enough to achieve similar performance to FSN [27], [28]. Although, two seconds of delay could be reasonable in a continuous streaming setup, it could be not so suitable for short utterances such as voice commands.…”
Section: Acoustic Feature Normalization For Streamingmentioning
confidence: 89%
“…Not surprisingly, empirical assessment of this extended architecture under strict streaming conditions proved it was really effective, indeed keeping the pace with non-streaming (offline) systems. The most recent refinement in connection to this research line has consisted in replacing streaming-adapted LSTM-RNN LMs with Transformer LMs [28]. In doing so, empirical results on the well-known LibriSpeech [29] and TED-LIUM [30] tasks have shown that this refinement leads to top, state-of-theart recognition rates and latencies under streaming conditions.…”
Section: Introductionmentioning
confidence: 99%
“…The most successful end-to-end ASR systems are based on connectionist temporal classification (CTC) [18], recurrent neural network (RNN) transducer (RNN-T) [17], and attention-based encoder-decoder architectures [19]. Recently, hybrid model systems have shown significant improvements in accuracy for streaming ASR [20,21]. Transformer is a sequence-to-sequence architecture originally proposed for machine translation [22].…”
Section: Background Asrmentioning
confidence: 99%
“…Directly modelling long-span word histories using conventional back-off n-gram models [1] generally leads to a severe data sparsity issue [2]. To this end, over the past few decades there have been significant efforts of developing artificial neural network based language modelling techniques in the speech technology community [3]- [14]. Neural network language models (NNLMs) representing longer span history contexts in a continuous and lower dimen-sional vector space, are used to improve the generalization performance.…”
Section: Introductionmentioning
confidence: 99%
“…With the rapid progress of deep neural network (DNN) based ASR technologies in recent decades, the underlying network architectures of NNLMs have evolved from feedforward structures [3]- [7] to more advanced variants represented by long-short term memory recurrent neural networks (LSTM-RNNs) [8]- [10], [15] and more recently neural Transformers [11]- [14], [16] that are designed for modelling longer range contexts. In particular, Transformer based language models in recent years have defined state-of-the-art performance across a range of ASR task domains [11]- [14], [17]. These models [11]- [13], [17] are often constructed using a deep stacking of multiple self-attention based neural building blocks [18]- [20], each of which also includes residual connections [21] and layer normalization modules [22].…”
Section: Introductionmentioning
confidence: 99%