End-to-End Neural Transformer Based Spoken Language Understanding

Radfar, Martin; Mouchtaris, Athanasios; Kunzmann, Siegfried

doi:10.21437/interspeech.2020-1963

Cited by 46 publications

(31 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Set 2 [Light pretraining] -Experiments (5)(6)(7)(8). In this category encoder and decoder have their components initialized with models trained on the ATIS dataset.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

End-to-End Spoken Language Understanding Using Transformer Networks and Self-Supervised Pre-Trained Features

Morais¹,

Kuo²,

Thomas³

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Transformer networks and self-supervised pre-training have consistently delivered state-of-art results in the field of natural language processing (NLP); however, their merits in the field of spoken language understanding (SLU) still need further investigation. In this paper we introduce a modular End-to-End (E2E) SLU transformer network based architecture which allows the use of self-supervised pretrained acoustic features, pre-trained model initialization and multi-task training. Several SLU experiments for predicting intent and entity labels/values using the ATIS dataset are performed. These experiments investigate the interaction of pre-trained model initialization and multi-task training with either traditional filterbank or self-supervised pre-trained acoustic features. Results show not only that self-supervised pre-trained acoustic features outperform filterbank features in almost all the experiments, but also that when these features are used in combination with multi-task training, they almost eliminate the necessity of pre-trained model initialization.

show abstract

“…Set 2 [Light pretraining] -Experiments (5)(6)(7)(8). In this category encoder and decoder have their components initialized with models trained on the ATIS dataset.…”

Section: Methodsmentioning

confidence: 99%

“…More recently, other RNN based seq2seq models have been proposed by [12], highlighting the importance of model pre-training. The first Transformer based seq2seq model for E2E SLU was introduced in [6]; however, the authors used an architecture which supports neither multi-task learning nor model pre-training.…”

Section: Introductionmentioning

confidence: 99%

End-to-End Spoken Language Understanding Using Transformer Networks and Self-Supervised Pre-Trained Features

Morais¹,

Kuo²,

Thomas³

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…SOTA transformer: A state-of-art end-to-end SLU model fully based on transformers [20] which was evaluated on the FluentSpeech Commands dataset. We compare to the best results in [20] from its classification-based model. SOTA RNN: A state-of-art bidirectional RNN encoder based end-to-end model presented in [18] designed for the FluentSpeech Commands dataset.…”

Section: Sincnet/dfsmn-transformermentioning

confidence: 99%

A Light Transformer For Speech-To-Intent Applications

Wang

hamme

2021

2021 IEEE Spoken Language Technology Workshop (SLT)

View full text Add to dashboard Cite

Spoken language understanding (SLU) systems can make life more agreeable, safer (e.g. in a car) or can increase the independence of physically challenged users. However, due to the many sources of variation in speech, a well-trained system is hard to transfer to other conditions like a different language or to speech impaired users. A remedy is to design a user-taught SLU system that can learn fully from scratch from users' demonstrations, which in turn requires that the system's model quickly converges after only a few training samples. In this paper, we propose a light transformer structure by using a simplified relative position encoding with the goal to reduce the model size and improve efficiency. The light transformer works as an alternative speech encoder for an existing user-taught multitask SLU system. Experimental results on three datasets with challenging speech conditions prove our approach outperforms the existed system and other state-of-art models with half of the original model size and training time.

show abstract

“…Transformers [21] are powerful neural architectures that lately have been used in ASR [22][23][24], SLU [25], and other audio-visual applications [26] with great success, mainly due to their attention mechanism. Only until recently, the attention concept has also been applied to beamforming, specifically for speech and noise mask estimations [9,27].…”

Section: Introductionmentioning

confidence: 99%

End-to-End Multi-Channel Transformer for Speech Recognition

Chang

Radfar

Mouchtaris

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Transformers are powerful neural architectures that allow integrating different modalities using attention mechanisms. In this paper, we leverage the neural transformer architectures for multi-channel speech recognition systems, where the spectral and spatial information collected from different microphones are integrated using attention layers. Our multi-channel transformer network mainly consists of three parts: channel-wise self attention layers (CSA), cross-channel attention layers (CCA), and multi-channel encoder-decoder attention layers (EDA). The CSA and CCA layers encode the contextual relationship "within" and "between" channels and across time, respectively. The channel-attended outputs from CSA and CCA are then fed into the EDA layers to help decode the next token given the preceding ones. The experiments show that in a far-field in-house dataset, our method outperforms the baseline single-channel transformer, as well as the super-directive and neural beamformers cascaded with the transformers.

show abstract

End-to-End Neural Transformer Based Spoken Language Understanding

Cited by 46 publications

References 28 publications

End-to-End Spoken Language Understanding Using Transformer Networks and Self-Supervised Pre-Trained Features

End-to-End Spoken Language Understanding Using Transformer Networks and Self-Supervised Pre-Trained Features

A Light Transformer For Speech-To-Intent Applications

End-to-End Multi-Channel Transformer for Speech Recognition

Contact Info

Product

Resources

About