ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9054715
|View full text |Cite
|
Sign up to set email alerts
|

Towards Fast and Accurate Streaming End-To-End ASR

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
74
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 105 publications
(74 citation statements)
references
References 13 publications
0
74
0
Order By: Relevance
“…We will now measure the latency of our cascade ST system in a streaming scenario. Following (Li et al, 2020), we define accumulative chunk-level latencies at three points in the system, as the time elapsed between the last word of a chunk being spoken, and: 1) The moment the consolidated hypothesis for that chunk is provided by the ASR system; 2) The moment the segmenter defines that chunk on the ASR consolidated hypothesis; 3) The moment the MT system translates the chunk defined by the segmenter. These three latency figures, in terms of mean and standard deviation, are shown in Table 6.…”
Section: Latency Evaluationmentioning
confidence: 99%
“…We will now measure the latency of our cascade ST system in a streaming scenario. Following (Li et al, 2020), we define accumulative chunk-level latencies at three points in the system, as the time elapsed between the last word of a chunk being spoken, and: 1) The moment the consolidated hypothesis for that chunk is provided by the ASR system; 2) The moment the segmenter defines that chunk on the ASR consolidated hypothesis; 3) The moment the MT system translates the chunk defined by the segmenter. These three latency figures, in terms of mean and standard deviation, are shown in Table 6.…”
Section: Latency Evaluationmentioning
confidence: 99%
“…End-to-end (E2E) recurrent neural network transducer (RNN-T) [4] models have gained enormous popularity for streaming ASR applications, as they are naturally streamable [1,5,6,7,10,11,12,13]. However, naive training with a sequence transduction objective [4] to maximize the log-probability of target sequence is unregularized and these streaming models learn to predict better by using more context, causing significant emission delay (i.e., the delay between the user speaking and the text appearing).…”
Section: Introductionmentioning
confidence: 99%
“…Recently there are some approaches trying to regularize or penalize the emission delay. For example, Li et al [1] proposed Early and Late Penalties to enforce the prediction of </s> (end of sentence) within a reasonable time window given by a voice activity detector (VAD). Constrained Alignments [2,3] were also proposed by extending the penalty terms to each word, based on speech-text alignment information [14] generated from an existing speech model.…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations