2020
DOI: 10.1109/taslp.2019.2959721
|View full text |Cite
|
Sign up to set email alerts
|

Multi-Stream End-to-End Speech Recognition

Abstract: Attention-based methods and Connectionist Temporal Classification (CTC) network have been promising research directions for end-to-end (E2E) Automatic Speech Recognition (ASR). The joint CTC/Attention model has achieved great success by utilizing both architectures during multi-task training and joint decoding. In this work, we present a multi-stream framework based on joint CTC/Attention E2E ASR with parallel streams represented by separate encoders aiming to capture diverse information. On top of the regular… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
15
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 16 publications
(15 citation statements)
references
References 50 publications
0
15
0
Order By: Relevance
“…An end-to-end ASR model addressing the general multi-stream setting was introduced in [23]. As one representative framework, MEM-Array concentrates on cases of far-field microphone arrays to handle different dynamics of streams.…”
Section: Mem-array Modelmentioning
confidence: 99%
See 1 more Smart Citation
“…An end-to-end ASR model addressing the general multi-stream setting was introduced in [23]. As one representative framework, MEM-Array concentrates on cases of far-field microphone arrays to handle different dynamics of streams.…”
Section: Mem-array Modelmentioning
confidence: 99%
“…In [23], we proposed a novel multi-stream model based on a joint CTC/Attention E2E scheme, where each stream is characterized by a separate encoder and CTC network. A Hierarchical Attention Network (HAN) [24,25] acts as a fusion component to dynamically assign higher weights for streams carrying more discriminative information for prediction.…”
Section: Introductionmentioning
confidence: 99%
“…In other words, it is expected to use auxiliary information to determine possible vowel reduction positions, and at the same time, to detect whether there is vowel reduction in acoustic information. There are some methods for fusing multi-stream input, such as the direct merging of highdimensional feature vectors and the hierarchical attention dynamic fusion method used by [20,21]. However, these methods treat different inputs as independent streams.…”
Section: Introductionmentioning
confidence: 99%
“…The second kind aims to conduct channel selection and channel fusion for optimizing the ASR performance directly [8,9,12,13]. Early methods [8,9] chose the channels with the highest likelihood of the output after the decoding of ASR.…”
Section: Introductionmentioning
confidence: 99%