2020
DOI: 10.48550/arxiv.2008.10984
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

End-to-End Neural Transformer Based Spoken Language Understanding

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
8
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(8 citation statements)
references
References 0 publications
0
8
0
Order By: Relevance
“…E2E ASR is implemented in ESPnet, where it has 12 Transformer encoder layers and 6 decoder layers. The choice of the Transformer is similar to [16]. E2E ASR is optimized with hybrid CTC/attention losses [30] with label smoothing.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…E2E ASR is implemented in ESPnet, where it has 12 Transformer encoder layers and 6 decoder layers. The choice of the Transformer is similar to [16]. E2E ASR is optimized with hybrid CTC/attention losses [30] with label smoothing.…”
Section: Methodsmentioning
confidence: 99%
“…However, these NLU works [10,12,13] usually ignore ASR or require an off-the-shelf ASR during testing. A line of E2E SLU work does take speech as input, yet it frames slots as intents and therefore their SLU models are really designed for IC only [8,9,14,15,16]. Another line of E2E SLU work jointly predicts text and IC/SL from speech, yet it either requires large amounts of in-house data, or restricts the pretraining scheme to ASR subword prediction [7,17,18,19].…”
Section: Introductionmentioning
confidence: 99%
“…E2E ASR is implemented in ESPnet [65], where it has 12 Transformer encoder layers and 6 decoder layers. The choice of the Transformer architecture [60] is due to its empirical successes in [33] and concurrent SLU work [48]. The E2E ASR is trained with hybrid CTC/attention loss [64] (CTC weight is 0.3, attention weight is 0.7) with label smoothing.…”
Section: Methodsmentioning
confidence: 99%
“…A common practice is to convert normalized token sequence in spoken form produced by ASR into a written form better suited to processing by downstream components in dialog systems [15]. This written form is then used to extract structured information in the form of intent and slot-values to continue a dialog [16].…”
Section: Related Workmentioning
confidence: 99%