Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-2840
|View full text |Cite
|
Sign up to set email alerts
|

Scaling Up Online Speech Recognition Using ConvNets

Abstract: We design an online end-to-end speech recognition system based on Time-Depth Separable (TDS) convolutions and Connectionist Temporal Classification (CTC). We improve the core TDS architecture in order to limit the future context and hence reduce latency while maintaining accuracy. The system has almost three times the throughput of a well tuned hybrid ASR baseline while also having lower latency and a better word error rate. Also important to the efficiency of the recognizer is our highly optimized beam search… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
23
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 33 publications
(23 citation statements)
references
References 22 publications
0
23
0
Order By: Relevance
“…Then, the training objective is defined by combining Eqs. ( 6) and ( 9): L := (1 − w)L CTC + wL InterCTC (10) with a hyper-parameter w.…”
Section: Intermediate Ctcmentioning
confidence: 99%
See 1 more Smart Citation
“…Then, the training objective is defined by combining Eqs. ( 6) and ( 9): L := (1 − w)L CTC + wL InterCTC (10) with a hyper-parameter w.…”
Section: Intermediate Ctcmentioning
confidence: 99%
“…Connectionist temporal classification (CTC) [7] has been a widely used method for end-to-end ASR modeling [8,9,10,11,12], and it is especially an attractive method for the on-device ASR. For low-end devices, CTC is suitable for lightweight modeling.…”
Section: Introductionmentioning
confidence: 99%
“…Decoder lag can be reduced by changing model architectures to reduce computation, as well as modifying the loss function to encour-age the model to output tokens more promptly [23]. Unlike in previous works [18,24] inter alia, we examine emission delay of the first token, and the time required to finalize the ASR result without considering per-token emission delays.…”
Section: Measuring Latency Metricsmentioning
confidence: 99%
“…Although recent advances on architectural design [7,8] and pre-training method [9] have improved the performance with CTC, it is usually weaker than encoder-decoder models, often credited to its strong conditional independence assumption, and overcoming the performance often requires external language models (LMs) and beam search algorithm [10,11], which demand extra computational costs and effectively makes the model an autoregressive one. Therefore, it is important to improve CTC modeling to reduce overall computational overhead, ideally without the help of LM and beam search.…”
Section: Introductionmentioning
confidence: 99%