Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-1720
|View full text |Cite
|
Sign up to set email alerts
|

A Study into Pre-Training Strategies for Spoken Language Understanding on Dysarthric Speech

Abstract: End-to-end (E2E) spoken language understanding (SLU) systems avoid an intermediate textual representation by mapping speech directly into intents with slot values. This approach requires considerable domain-specific training data. In lowresource scenarios this is a major concern, e.g., in the present study dealing with SLU for dysarthric speech. Pretraining part of the SLU model for automatic speech recognition targets helps but no research has shown to which extent SLU on dysarthric speech benefits from knowl… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
3

Relationship

3
5

Authors

Journals

citations
Cited by 8 publications
(6 citation statements)
references
References 12 publications
0
6
0
Order By: Relevance
“…Inspired by these, we explore pre-training of a TDNN acoustic model on a publicly available dysarthric speech corpus with ASR targets and then extract layer activations of the well-trained TDNN model as BNFs for dysarthric end-user utterances. The TDNN-based acoustic model is evaluated with capsule networks in [36] to predict SLU intent labels. To verify that we are learning something from the TDNN-based acoustic model other than just the idiosyncrasies of a particular decoding model, we will extensively evaluate it with three types of SLU decoders for semantic inference: the NMF model, the multilayer capsule network model, and the LSTM model.…”
Section: Contributionsmentioning
confidence: 99%
“…Inspired by these, we explore pre-training of a TDNN acoustic model on a publicly available dysarthric speech corpus with ASR targets and then extract layer activations of the well-trained TDNN model as BNFs for dysarthric end-user utterances. The TDNN-based acoustic model is evaluated with capsule networks in [36] to predict SLU intent labels. To verify that we are learning something from the TDNN-based acoustic model other than just the idiosyncrasies of a particular decoding model, we will extensively evaluate it with three types of SLU decoders for semantic inference: the NMF model, the multilayer capsule network model, and the LSTM model.…”
Section: Contributionsmentioning
confidence: 99%
“…The mode of meaning included the specific speech inputs into the proposed models. Studies that solely used speech features [15,[24][25][26][27][28][29][30][31][32][33][34][35] were heavily speaker-dependent on their approach and tended to lean more toward intelligibility of the dysarthric speaker than their comprehensibility.…”
Section: Mode Of Meaning Extraction Usedmentioning
confidence: 99%
“…Finally, studies that used a hybrid approach [23,33,34,45,48,49] combined MFCC with variations of vector encoding. These studies were characterized by variations of models, such as adversarial networks, support vector machines, gated recurrent unit and convolutional neural networks, and hidden Markov models.…”
Section: Nature Of Speech Representations Usedmentioning
confidence: 99%
“…To adapt to the Dutch Domotica data, we pre-train this model accompanied with CTC loss using Dutch Copas disordered speech [30]. Kaldi is a TDNN-F-based model proposed in [31], pre-trained on the Dutch Copas data as well. All compared models are combined with the same capsule network decoder for intent classification as described in the Section.2.4.…”
Section: Domotica Datasetmentioning
confidence: 99%
“…All compared models are combined with the same capsule network decoder for intent classification as described in the Section.2.4. [7] 0.931 27M + 0.7M Kaldi (pre-training) [31] 0.939 19M + 0.7M…”
Section: Domotica Datasetmentioning
confidence: 99%