2022
DOI: 10.48550/arxiv.2204.03863
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Automatic Pronunciation Assessment using Self-Supervised Speech Representation Learning

Abstract: Self-supervised learning (SSL) approaches such as wav2vec 2.0 and HuBERT models have shown promising results in various downstream tasks in the speech community. In particular, speech representations learned by SSL models have been shown to be effective for encoding various speech-related characteristics. In this context, we propose a novel automatic pronunciation assessment method based on SSL models. First, the proposed method fine-tunes the pre-trained SSL models with connectionist temporal classification t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
6
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(6 citation statements)
references
References 29 publications
0
6
0
Order By: Relevance
“…As can be seen in Table 3, the system with the setting of "SSL + ASR-based", i.e., SSL features with forced-alignment, achieves the best performance, followed by the proposed approach and then the other two approaches. The recent work "SSL & Text" [28] developed a scorer by modeling the SSL speech representations and prompt text information as two separate streams. Note that different sizes and types of SSL models were tried and the best-reported performance comes from the non-native ASR fine-tuned HuBERT Large.…”
Section: Experimental Results In the "Read Aloud" Scenariomentioning
confidence: 99%
“…As can be seen in Table 3, the system with the setting of "SSL + ASR-based", i.e., SSL features with forced-alignment, achieves the best performance, followed by the proposed approach and then the other two approaches. The recent work "SSL & Text" [28] developed a scorer by modeling the SSL speech representations and prompt text information as two separate streams. Note that different sizes and types of SSL models were tried and the best-reported performance comes from the non-native ASR fine-tuned HuBERT Large.…”
Section: Experimental Results In the "Read Aloud" Scenariomentioning
confidence: 99%
“…Table 4 compares the proposed approach with various other approaches using different datasets such as Spee-chocean762, TIMIT, LibriSpeech, and more. Latest models such as HuBERT [35] and Wav2Vec2 [36] were also compared. It should be noted that our model was not fine-tuned whereas all other models were fine-tuned before getting these results.…”
Section: Resultsmentioning
confidence: 99%
“…Jiaotong Shi et al proposed the contextdependent CaGOP algorithm, which predicts the duration of each phoneme by feeding the reference text into a selfattentive text-based encoder during GOP calculation, and uses the difference between the expected duration and the actual duration of the phoneme obtained by forced alignment as the penalty factor in the GOP calculation [12]. In addition to the studies on spoken language evaluation tasks using GOP algorithms, there are also studies on verbal language evaluation conducted without GOP algorithms, such as wav2vec2.0-based [13], [14], [15] and deep learning featurebased methods [16], but because of the limited speech data available to L2 speakers, training with such methods usually requires the use of pre-trained models and transfer learning. Oral language assessment studies are usually divided into automatic mispronunciation detection and automatic pronunciation quality assessment according to the task's objectives.…”
Section: Related Workmentioning
confidence: 99%