Detection of Lexical Stress Errors in Non-Native (L2) English with Data Augmentation and Attention

Korzekwa, Daniel; Barra-Chicote, Roberto; Zaporowski, Szymon; Beringer, Grzegorz; Lorenzo-Trueba, Jaime; Serafinowicz, Alicja; Droppo, Jasha; Drugman, Thomas; Kostek, Bożena

doi:10.48550/arxiv.2012.14788

Cited by 1 publication

(2 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the future, we will experiment with discrete representation of the latent phoneme space such as Vector-Quantized Variational-Auto-Encoder (VQ-VAE) [36,37], which should fit better to discrete nature of phonemes. We plan to generate synthetic mispronounced speech, which is motivated by our recent work on using speech synthesis for generating speech errors in the related task of lexical stress error detection [34].…”

Section: Discussionmentioning

confidence: 99%

“…Then, for each utterance, we replace phonemes with random phonemes with a probability of 0.2. In [34] we found that generating incorrectly stressed speech using Text-To-Speech (TTS) improves the accuracy of detecting lexical stress errors in L2 speech. Although, as opposed to using TTS, we create pronunciation errors by perturbing the text, we expect this simpler approach should still help recognizing word-level pronunciation errors.…”

Section: Speech Corpora and Metricsmentioning

confidence: 99%

See 1 more Smart Citation

Weakly-supervised word-level pronunciation error detection in non-native English speech

Korzekwa¹,

Lorenzo-Trueba²,

Drugman³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

We propose a weakly-supervised model for word-level mispronunciation detection in non-native (L2) English speech. To train this model, phonetically transcribed L2 speech is not required and we only need to mark mispronounced words. The lack of phonetic transcriptions for L2 speech means that the model has to learn only from a weak signal of word-level mispronunciations. Because of that and due to the limited amount of mispronounced L2 speech, the model is more likely to overfit. To limit this risk, we train it in a multi-task setup. In the first task, we estimate the probabilities of word-level mispronunciation. For the second task, we use a phoneme recognizer trained on phonetically transcribed L1 speech that is easily accessible and can be automatically annotated. Compared to state-of-the-art approaches, we improve the accuracy of detecting word-level pronunciation errors in AUC metric by 30% on the GUT Isle Corpus of L2 Polish speakers, and by 21.5% on the Isle Corpus of L2 German and Italian speakers.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Speech Corpora and Metricsmentioning

confidence: 99%