Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-555
|View full text |Cite
|
Sign up to set email alerts
|

Multi-Encoder Learning and Stream Fusion for Transformer-Based End-to-End Automatic Speech Recognition

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 12 publications
(4 citation statements)
references
References 0 publications
0
4
0
Order By: Relevance
“…In this section, the proposed method is discussed. We adapt and extend the middle fusion approach presented in [15] to produce a process that allows us to combine auditory inputs with partially available visual features, during training as well as inference. Section 2.1 goes into the attention-based neural networks which are employed as base components of the considered systems.…”
Section: Methodsmentioning
confidence: 99%
See 3 more Smart Citations
“…In this section, the proposed method is discussed. We adapt and extend the middle fusion approach presented in [15] to produce a process that allows us to combine auditory inputs with partially available visual features, during training as well as inference. Section 2.1 goes into the attention-based neural networks which are employed as base components of the considered systems.…”
Section: Methodsmentioning
confidence: 99%
“…In this section, we discuss the proposed multimodal multi-encoder learning framework. To this end, we first review the original middle fusion algorithm described in [15]. Afterwards, we explain the adjustments that have to be made to ensure the resulting approach is Figure 4 shows a simplified diagram of the middle fusion multi-encoder learning framework.…”
Section: Multimodal Multi-encoder Learning Frameworkmentioning
confidence: 99%
See 2 more Smart Citations