2020
DOI: 10.1109/taslp.2020.2987752
|View full text |Cite
|
Sign up to set email alerts
|

Online Hybrid CTC/Attention End-to-End Automatic Speech Recognition Architecture

Abstract: Recently, there has been increasing progress in endto-end automatic speech recognition (ASR) architecture, which transcribes speech to text without any pre-trained alignments. One popular end-to-end approach is the hybrid Connectionist Temporal Classification (CTC) and attention (CTC/attention) based ASR architecture, which utilizes the advantages of both CTC and attention. The hybrid CTC/attention ASR systems exhibit performance comparable to that of the conventional deep neural network (DNN) / hidden Markov … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
32
0
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3
3

Relationship

1
8

Authors

Journals

citations
Cited by 50 publications
(33 citation statements)
references
References 38 publications
0
32
0
1
Order By: Relevance
“…In [47], the authors proposed an online hybrid CTC/attention E2E ASR architecture that replaces all the offline components of a conventional CTC/attention ASR architecture with their corresponding streaming components by using LibriSpeech English and Mandarin tasks (from the Hong Kong University of Science and Technology, HKUST) to decode the speech in a low-latency and real-time manner. The researchers in [92] introduced a combined framework to integrate social signal detection (SSD) and ASR systems based on CTC, which is an end-to-end model.…”
Section: ) Signal Processingmentioning
confidence: 99%
“…In [47], the authors proposed an online hybrid CTC/attention E2E ASR architecture that replaces all the offline components of a conventional CTC/attention ASR architecture with their corresponding streaming components by using LibriSpeech English and Mandarin tasks (from the Hong Kong University of Science and Technology, HKUST) to decode the speech in a low-latency and real-time manner. The researchers in [92] introduced a combined framework to integrate social signal detection (SSD) and ASR systems based on CTC, which is an end-to-end model.…”
Section: ) Signal Processingmentioning
confidence: 99%
“…MTA [19] aims to solve the training and decoding mismatch problem. Specifically, MoChA and sMoChA only take the context with a predefined chunk width w during decoding, but they receive the full historical information of input during training.…”
Section: Streaming Attentionmentioning
confidence: 99%
“…7. For full details, please refer to [19]. We use both sMoChA and MTA techniques in the experiments section.…”
Section: Streaming Attentionmentioning
confidence: 99%
“…This not only speeds up the model training, but also expands the scope of attention to all the encoding timesteps before the current truncating point. In addition, MTA has so far given the best ASR performance among various hard attention mechanisms according to [28].…”
Section: Transformer For Online Asrmentioning
confidence: 99%