Multi-Encoder Learning and Stream Fusion for Transformer-Based End-to-End Automatic Speech Recognition

Lohrenz, Timo; Li, Zhengyang; Fingscheidt, Tim

doi:10.21437/interspeech.2021-555

Cited by 12 publications

(4 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this section, the proposed method is discussed. We adapt and extend the middle fusion approach presented in [15] to produce a process that allows us to combine auditory inputs with partially available visual features, during training as well as inference. Section 2.1 goes into the attention-based neural networks which are employed as base components of the considered systems.…”

Section: Methodsmentioning

confidence: 99%

“…In this section, we discuss the proposed multimodal multi-encoder learning framework. To this end, we first review the original middle fusion algorithm described in [15]. Afterwards, we explain the adjustments that have to be made to ensure the resulting approach is Figure 4 shows a simplified diagram of the middle fusion multi-encoder learning framework.…”

Section: Multimodal Multi-encoder Learning Frameworkmentioning

confidence: 99%

“…In [15], this principle is applied in the context of automatic speech recognition: Features representing the spectral magnitude and phase components of the used auditory data are combined using this multi-encoder approach to enhance the performance of the decoder, which is charged with the task of transforming input characters into output token probabilities. Fixed interpolation weights are used for calculating the relevant convex sums, biased towards the (generally) more salient magnitude representations.…”

Section: Multimodal Multi-encoder Learning Frameworkmentioning

confidence: 99%

“…Instead, we propose an approach based on dynamically weighted fusion of intermediary auditory and visual features. This is done by adapting the multi-encoder framework presented in [15], which was originally utilized to achieve more robust predictions for speech recognition, without necessarily increasing time and/or memory complexity.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Multi-encoder attention-based architectures for sound recognition with partial visual assistance

Boes

hamme

2022

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

Large-scale sound recognition data sets typically consist of acoustic recordings obtained from multimedia libraries. As a consequence, modalities other than audio can often be exploited to improve the outputs of models designed for associated tasks. Frequently, however, not all contents are available for all samples of such a collection: For example, the original material may have been removed from the source platform at some point, and therefore, non-auditory features can no longer be acquired. We demonstrate that a multi-encoder framework can be employed to deal with this issue by applying this method to attention-based deep learning systems, which are currently part of the state of the art in the domain of sound recognition. More specifically, we show that the proposed model extension can successfully be utilized to incorporate partially available visual information into the operational procedures of such networks, which normally only use auditory features during training and inference. Experimentally, we verify that the considered approach leads to improved predictions in a number of evaluation scenarios pertaining to audio tagging and sound event detection. Additionally, we scrutinize some properties and limitations of the presented technique.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Multimodal Multi-encoder Learning Frameworkmentioning

confidence: 99%

Section: Multimodal Multi-encoder Learning Frameworkmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Multi-encoder attention-based architectures for sound recognition with partial visual assistance

Boes

hamme

2022

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

show abstract

Knowledge-enhanced semantic communication system with OFDM transmissions

Xiong

Wang

et al. 2023

Sci. China Inf. Sci.

View full text Add to dashboard Cite

Multi-Encoder Transformer for Korean Abstractive Text Summarization

Shin

2023

IEEE Access

View full text Add to dashboard Cite

In this paper, we propose a Korean abstractive text summarization approach that uses a multi-encoder transformer. Recently, in many natural language processing (NLP) tasks, the use of the pre-trained language models (PLMs) for transfer learning has achieved remarkable performance. In particular, transformer-based models such as Bidirectional Encoder Representations from Transformers (BERT) are used for pre-training and applied to downstream tasks, showing state-of-the-art performance including abstractive text summarization. However, existing text summarization models usually use one pre-trained model per model architecture, meaning that it becomes necessary to choose one PLM at a time. For PLMs applicable to Korean abstractive text summarization, there are publicly available BERT-based pretrained Korean models that offer different advantages such as Multilingual BERT, KoBERT, HanBERT, and KorBERT. We assume that if these PLMs could be leveraged simultaneously, better performance would be obtained. We propose a model that uses multiple encoders which are capable of leveraging multiple pre-trained models to create an abstractive summary. We evaluate our method using three benchmark Korean abstractive summarization datasets, each named Law (AI-Hub), News (AI-Hub), and News (NIKL) datasets. Experimental results show that the proposed multi-encoder model variations outperform single-encoder models. We find the empirically best summarization model by determining the optimal input combination when leveraging multiple PLMs with the multi-encoder method.INDEX TERMS Natural language processing, abstractive text summarization, bidirectional encoder representations from transformer, neural networks, natural language generation.

show abstract

Multi-Encoder Learning and Stream Fusion for Transformer-Based End-to-End Automatic Speech Recognition

Cited by 12 publications

References 0 publications

Multi-encoder attention-based architectures for sound recognition with partial visual assistance

Multi-encoder attention-based architectures for sound recognition with partial visual assistance

Knowledge-enhanced semantic communication system with OFDM transmissions

Multi-Encoder Transformer for Korean Abstractive Text Summarization

Contact Info

Product

Resources

About