2020
DOI: 10.48550/arxiv.2012.14788
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Detection of Lexical Stress Errors in Non-Native (L2) English with Data Augmentation and Attention

Abstract: This paper describes two novel complementary techniques that improve the detection of lexical stress errors in non-native (L2) English speech: attention-based feature extraction and data augmentation based on Neural Text-To-Speech (TTS). In a classical approach, audio features are usually extracted from fixed regions of speech such as syllable nucleus. We propose an attention-based deep learning model that automatically derives optimal syllable-level representation from frame-level and phoneme-level audio feat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
1

Relationship

1
0

Authors

Journals

citations
Cited by 1 publication
(2 citation statements)
references
References 16 publications
0
2
0
Order By: Relevance
“…In the future, we will experiment with discrete representation of the latent phoneme space such as Vector-Quantized Variational-Auto-Encoder (VQ-VAE) [36,37], which should fit better to discrete nature of phonemes. We plan to generate synthetic mispronounced speech, which is motivated by our recent work on using speech synthesis for generating speech errors in the related task of lexical stress error detection [34].…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…In the future, we will experiment with discrete representation of the latent phoneme space such as Vector-Quantized Variational-Auto-Encoder (VQ-VAE) [36,37], which should fit better to discrete nature of phonemes. We plan to generate synthetic mispronounced speech, which is motivated by our recent work on using speech synthesis for generating speech errors in the related task of lexical stress error detection [34].…”
Section: Discussionmentioning
confidence: 99%
“…Then, for each utterance, we replace phonemes with random phonemes with a probability of 0.2. In [34] we found that generating incorrectly stressed speech using Text-To-Speech (TTS) improves the accuracy of detecting lexical stress errors in L2 speech. Although, as opposed to using TTS, we create pronunciation errors by perturbing the text, we expect this simpler approach should still help recognizing word-level pronunciation errors.…”
Section: Speech Corpora and Metricsmentioning
confidence: 99%