2022
DOI: 10.1016/j.wocn.2022.101137
|View full text |Cite
|
Sign up to set email alerts
|

Neural representations for modeling variation in speech

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
13
0
1

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3

Relationship

2
5

Authors

Journals

citations
Cited by 13 publications
(15 citation statements)
references
References 36 publications
1
13
0
1
Order By: Relevance
“…We compute embeddings from the hidden Transformer layers of three fine-tuned deep acoustic wav2vec 2.0 large models, and subsequently determine pronunciation differences using dynamic time warping (DTW) with these embeddings (Müller, 2007). We use fine-tuned acoustic models in this study as their hidden representations were found to show the closest match with human perceptual judgements of pronunciation variation (Bartelds et al, 2022). For the transcription-based approach, we apply a (phonetically sensitive) Levenshtein distance algorithm to the available corresponding phonetic transcriptions of the 10 words in all locations.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…We compute embeddings from the hidden Transformer layers of three fine-tuned deep acoustic wav2vec 2.0 large models, and subsequently determine pronunciation differences using dynamic time warping (DTW) with these embeddings (Müller, 2007). We use fine-tuned acoustic models in this study as their hidden representations were found to show the closest match with human perceptual judgements of pronunciation variation (Bartelds et al, 2022). For the transcription-based approach, we apply a (phonetically sensitive) Levenshtein distance algorithm to the available corresponding phonetic transcriptions of the 10 words in all locations.…”
Section: Methodsmentioning
confidence: 99%
“…Recently, Bartelds et al (2022) found that representations from the hidden layers of pre-trained and fine-tuned wav2vec 2.0 (large) models are suitable to represent language variation. They showed that these representations capture linguistic information that is not represented by phonetic transcriptions, while being less sensitive to non-linguistic variation in the speech signal.…”
Section: Introductionmentioning
confidence: 99%
“…For BNF, we used the BUT/Phonexia feature extractor which returns for each time frame 80 activation values from a bottleneck layer originally trained for phone classi-1 https://docs.cognitive-ml.fr/shennong/ fication on the 17 languages of the IARPA Babel dataset [21]. For w2v2 features, we adapted the feature extraction code from [17], 2 which can extract outputs from the Encoder CNN (E), the Quantiser module (Q), or any one of the 24 layers of the Transformer network (T01-T24). Thus, in total we extracted features using 28 different methods for each of the 10 datasets.…”
Section: Methodsmentioning
confidence: 99%
“…Typically, representations extracted from the final layers of a Transformer network tend to be more suited to the original training task than its middle layers, which are better suited for downstream tasks [16]. Experiments in [17] investigated which of the 24 w2v2 Transformer layers may be best suited for automatic pronunciation scoring of non-native English speech. Similar to QbE-STD, the task of pronunciation scoring is a 2-stage process: features are extracted from native and non-native speech samples of the same read text, and then a DTW-based distance is calculated (lower distance indicates closer pronunciations).…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation