Using Speech Production Knowledge for Raw Waveform Modelling Based Styrian Dialect Identification

Dubagunta, S. Pavankumar; Magimai-Doss, Mathew

doi:10.21437/interspeech.2019-2398

“…As shown in this figure, the input to the system consists of pairs of reference and test representations of utterances. We follow the same procedure as in [22] to extract AP features for the representations of utterances (cf. Section 3.2).…”

Section: Technical Approachmentioning

confidence: 99%

“…AP representations are extracted as in [22], where frame-level posteriors of four articulatory categories are computed, i.e., manner of articulation (e.g., degree of constriction), place of constriction, height of the tongue, and vowel. Posteriors for each category are estimated using CNNs trained on healthy speech data from the AMI corpus [25] based on acoustic phoneme-to-articulatory feature mappings [21].…”

Section: Articulatory Posterior Representationmentioning

confidence: 99%

See 1 more Smart Citation

Automatic Dysarthric Speech Detection Exploiting Pairwise Distance-Based Convolutional Neural Networks

Janbakhshi

¹

,

Kodrasi

²

,

Bourlard

³

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Automatic dysarthric speech detection can provide reliable and cost-effective computer-aided tools to assist the clinical diagnosis and management of dysarthria. In this paper we propose a novel automatic dysarthric speech detection approach based on analyses of pairwise distance matrices using convolutional neural networks (CNNs). We represent utterances through articulatory posteriors and consider pairs of phonetically-balanced representations, with one representation from a healthy speaker (i.e., the reference representation) and the other representation from the test speaker (i.e., test representation). Given such pairs of reference and test representations, features are first extracted using a feature extraction front-end, a frame-level distance matrix is computed, and the obtained distance matrix is considered as an image by a CNN-based binary classifier. The feature extraction, distance matrix computation, and CNN-based classifier are jointly optimized in an end-to-end framework. Experimental results on two databases of healthy and dysarthric speakers for different languages and pathologies show that the proposed approach yields a high dysarthric speech detection performance, outperforming other CNN-based baseline approaches.

show abstract

“…As shown in this figure, the input to the system consists of pairs of reference and test representations of utterances. We follow the same procedure as in [20] to extract AP features for the representations of utterances (cf. Section 3.2).…”

Section: Technical Approachmentioning

confidence: 99%

Automatic dysarthric speech detection exploiting pairwise distance-based convolutional neural networks

Janbakhshi¹,

Kodrasi²,

Bourlard³

2020

Preprint

View full text Add to dashboard Cite

Automatic dysarthric speech detection can provide reliable and cost-effective computer-aided tools to assist the clinical diagnosis and management of dysarthria. In this paper we propose a novel automatic dysarthric speech detection approach based on analyses of pairwise distance matrices using convolutional neural networks (CNNs). We represent utterances through articulatory posteriors and consider pairs of phonetically-balanced representations, with one representation from a healthy speaker (i.e., the reference representation) and the other representation from the test speaker (i.e., test representation). Given such pairs of reference and test representations, features are first extracted using a feature extraction front-end, a frame-level distance matrix is computed, and the obtained distance matrix is considered as an image by a CNN-based binary classifier. The feature extraction, distance matrix computation, and CNN-based classifier are jointly optimized in an end-to-end framework. Experimental results on two databases of healthy and dysarthric speakers for different languages and pathologies show that the proposed approach yields a high dysarthric speech detection performance, outperforming other CNN-based baseline approaches.

show abstract

“…Currently, some applications of speech explored learning directly from raw waveform such as speech recognition [23]- [26], speaker verification [27], emotion recognition [28], and environment sound recognition [29]. In [30], raw waveform modeling approaches are used in Styrian dialect identification which performed better than the baseline methods. Inspired by this, we focus on analyzing the CNN filters trained on raw waveform for accent classification.…”

Section: Introductionmentioning

confidence: 99%

Learning Filterbanks from Raw Waveform for Accent Classification

Kethireddy

¹

,

Kadiri

²

,

Gangashetty

³

2020

2020 International Joint Conference on Neural Networks (IJCNN)

View full text Add to dashboard Cite

Most of the applications in speech use mel-frequency spectral coefficients (MFSC) as features as they match the human perceptual mechanism, where the emphasis is given to vocal tract characteristics. But in accent classification, mel-scale distribution of filters may not always be the best representations, e.g., pitch accented languages where the emphasis should be on vocal source information too. Motivated by this, we use end-toend classification of accents directly from waveforms which will reduce the effort of designing features specific to each corpus. The convolution neural network (CNN) model architecture is designed in such a way that the initial layers exhibit similar operation as in MFSC by initializing the weights using time approximate of MFSC. The entire network along with initial layers is trained to learn accent classification. We observed that learning directly from waveform improved the performance of accent classification when compared to CNN trained on hand-engineered features by 10.94% UAR on the test dataset of common voice corpus. Analyzing the filters after learning, we observed changes in distribution and bandwidths of center frequencies. We further observed the importance of appropriately initializing CNN filters.

show abstract

Using Speech Production Knowledge for Raw Waveform Modelling Based Styrian Dialect Identification

Cited by 5 publications

References 14 publications

Automatic Dysarthric Speech Detection Exploiting Pairwise Distance-Based Convolutional Neural Networks

Automatic Dysarthric Speech Detection Exploiting Pairwise Distance-Based Convolutional Neural Networks

Automatic dysarthric speech detection exploiting pairwise distance-based convolutional neural networks

Learning Filterbanks from Raw Waveform for Accent Classification

Contact Info

Product

Resources

About