2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) 2022
DOI: 10.23919/apsipaasc55919.2022.9979979
|View full text |Cite
|
Sign up to set email alerts
|

3M: An Effective Multi-view, Multi-granularity, and Multi-aspect Modeling Approach to English Pronunciation Assessment

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
15
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 12 publications
(15 citation statements)
references
References 29 publications
0
15
0
Order By: Relevance
“…Table 4 compares the proposed approach with various other approaches using different datasets such as Spee-chocean762, TIMIT, LibriSpeech, and more. Latest models such as HuBERT [35] and Wav2Vec2 [36] were also compared. It should be noted that our model was not fine-tuned whereas all other models were fine-tuned before getting these results.…”
Section: Resultsmentioning
confidence: 99%
“…Table 4 compares the proposed approach with various other approaches using different datasets such as Spee-chocean762, TIMIT, LibriSpeech, and more. Latest models such as HuBERT [35] and Wav2Vec2 [36] were also compared. It should be noted that our model was not fine-tuned whereas all other models were fine-tuned before getting these results.…”
Section: Resultsmentioning
confidence: 99%
“…All this emphasizes that it has a high potential to be used to obtain important characteristics. Among the successes reported, one of them is encouraging results in the assessment of English pronunciation [14]. Likewise with the successful use of prosody features in identifying the correct recitation of the Qur'an [15].…”
Section: Related Workmentioning
confidence: 93%
“…A follow-up work attempted to further improve the system by using multi-view representations that * This work was done during an internship at ByteDance AI Lab. comprise additional prosodic and SSL speech features [15] . In [16], Bi-directional Long Short Term Memory (BLSTM) is adopted to predict fluency scores from a sequence of phone-level deep features (usually extracted from the bottleneck layer of a DNN acoustic model) and its corresponding phone duration features.…”
Section: Introductionmentioning
confidence: 99%