Learning Acoustic Word Embeddings With Dynamic Time Warping Triplet Networks

Shitov, Denis; Pirogova, Elena; Wysocki, Tadeusz A.; Lech, Margaret

doi:10.1109/access.2020.2999055

Cited by 4 publications

(4 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Additionally, the study proposes to extract acoustic features using the acoustic word embedding (AWE) model [22]. This model was trained to discriminate between different words and allows for compact encoding of acoustics while preserving contextual information of the input.…”

Section: Contributionsmentioning

confidence: 99%

“…To address the problem of inter and intra-speaker variability of speech representation, the acoustic word embedding (AWE) model [22] was employed to obtain embeddings from MFCCs. These embeddings were then used as an acoustic representation of speech in the learning algorithm.…”

Section: Acoustic Representationmentioning

confidence: 99%

“…Here, the AWE model was trained on the speech commands dataset in accordance with the original work [22] followed by fine-tuning on the synthetic speech dataset used as references in the vowel-to-vowel imitation task. Details of the reference dataset are presented in Section 3.3.…”

Section: Acoustic Representationmentioning

confidence: 99%

See 2 more Smart Citations

Deep Reinforcement Learning for Articulatory Synthesis in a Vowel-to-Vowel Imitation Task

Shitov

Pirogova

Wysocki

et al. 2023

Sensors

Self Cite

View full text Add to dashboard Cite

Articulatory synthesis is one of the approaches used for modeling human speech production. In this study, we propose a model-based algorithm for learning the policy to control the vocal tract of the articulatory synthesizer in a vowel-to-vowel imitation task. Our method does not require external training data, since the policy is learned through interactions with the vocal tract model. To improve the sample efficiency of the learning, we trained the model of speech production dynamics simultaneously with the policy. The policy was trained in a supervised way using predictions of the model of speech production dynamics. To stabilize the training, early stopping was incorporated into the algorithm. Additionally, we extracted acoustic features using an acoustic word embedding (AWE) model. This model was trained to discriminate between different words and to enable compact encoding of acoustics while preserving contextual information of the input. Our preliminary experiments showed that introducing this AWE model was crucial to guide the policy toward a near-optimal solution. The acoustic embeddings, obtained using the proposed approach, were revealed to be useful when applied as inputs to the policy and the model of speech production dynamics.

show abstract

Section: Contributionsmentioning

confidence: 99%

Section: Acoustic Representationmentioning

confidence: 99%

See 1 more Smart Citation

Deep Reinforcement Learning for Articulatory Synthesis in a Vowel-to-Vowel Imitation Task

Shitov

Pirogova

Wysocki

et al. 2023

Sensors

Self Cite

View full text Add to dashboard Cite

show abstract

“…We used deep network embeddings as inputs to the unsupervised clustering module. Network embeddings extracted from pre-trained CNN models have been shown to provide excellent performance in the unsupervised classification of natural images [41] and speech synthesis [42], outperforming classical image features.…”

Section: Feature Extractionmentioning

confidence: 99%

Adversarial Learning Approach to Unsupervised Labeling of Fine Art Paintings

2021

Self Cite

View full text Add to dashboard Cite

An automatic classification of fine art images is limited by the scarcity of high-quality labels made by art experts. This study aims to provide meaningful automatic labeling of fine art paintings (machine labeling) without the need for human annotation. A new unsupervised Adversarial Clustering System (ACS) is proposed. The ACS is an adversarial learning approach comprising an unsupervised clustering module generating machine labels and a supervised classification module classifying the data based on the machine labels. Both modules are linked through an optimization algorithm iteratively improving the unsupervised clusters. The objective function driving the improvement consists of the within-cluster sum of squares (WCSS) error and the supervised classification accuracy. The proposed method was tested on three different fine-art datasets, including two sets of paintings previously categorized by art experts and one never categorized collection of Australian Aboriginal paintings. The unsupervised clusters were analyzed using standard unsupervised clustering metrics and a reliability measure between machine and human labeling. The ACS showed higher reliability compared to the classical k-means clustering method. The content analysis of unsupervised clusters indicated grouping based on scene composition, type, and shape of the object, edge sharpness and direction, and color palette.

show abstract