Zhiyuan Peng scite author profile

Zhiyuan Peng

5Publications

56Citation Statements Received

77Citation Statements Given

How they've been cited

How they cite others

119

Affiliations

Santa Clara University, Chinese University of Hong Kong, China Automotive Engineering Research Institute

Publications

Order By: Most citations

Combining Adversarial Training and Disentangled Speech Representation for Robust Zero-Resource Subword Modeling

Feng

Lee

Peng

2019

View full text Add to dashboard Cite

This study addresses the problem of unsupervised subword unit discovery from untranscribed speech. It forms the basis of the ultimate goal of ZeroSpeech 2019, building text-to-speech systems without text labels. In this work, unit discovery is formulated as a pipeline of phonetically discriminative feature learning and unit inference. One major difficulty in robust unsupervised feature learning is dealing with speaker variation. Here the robustness towards speaker variation is achieved by applying adversarial training and FHVAE based disentangled speech representation learning. A comparison of the two approaches as well as their combination is studied in a DNN-bottleneck feature (DNN-BNF) architecture. Experiments are conducted on ZeroSpeech 2019 and 2017. Experimental results on Ze-roSpeech 2017 show that both approaches are effective while the latter is more prominent, and that their combination brings further marginal improvement in across-speaker condition. Results on ZeroSpeech 2019 show that in the ABX discriminability task, our approaches significantly outperform the official baseline, and are competitive to or even outperform the official topline. The proposed unit sequence smoothing algorithm improves synthesis quality, at a cost of slight decrease in ABX discriminability.

show abstract

Child Speech Disorder Detection with Siamese Recurrent Network Using Speech Attribute Features

Wang¹,

Qin²,

Peng³

et al. 2019

View full text Add to dashboard Cite

Acoustics-based automatic assessment is a highly desirable approach to detecting speech sound disorder (SSD) in children. The performance of an automatic speech assessment system depends greatly on the availability of a good amount of properly annotated disordered speech, which is a critical problem particularly for child speech. This paper presents a novel design of child speech disorder detection system that requires only normal speech for model training. The system is based on a Siamese recurrent network, which is trained to learn the similarity and discrepancy of pronunciations between a pair of phones in the embedding space. For detection of speech sound disorder, the trained network measures a distance that contrasts the test phone to the desired phone and the distance is used to train a binary classifier. Speech attribute features are incorporated to measure the pronunciation quality and provide diagnostic feedback. Experimental results show that Siamese recurrent network with a combination of speech attribute features and phone posterior features could attain an optimal detection accuracy of 0.941.

show abstract

Adversarial Multi-task Deep Features and Unsupervised Back-end Adaptation for Language Recognition

Peng

Feng

Lee

2019

View full text Add to dashboard Cite

Mixture Factorized Auto-Encoder for Unsupervised Hierarchical Deep Factorization of Speech Signal

Peng

Feng

Lee

2020

View full text Add to dashboard Cite

Speech signal is constituted and contributed by various informative factors, such as linguistic content and speaker characteristic. There have been notable recent studies attempting to factorize speech signal into these individual factors without requiring any annotation. These studies typically assume continuous representation for linguistic content, which is not in accordance with general linguistic knowledge and may make the extraction of speaker information less successful. This paper proposes the mixture factorized auto-encoder (mFAE) for unsupervised deep factorization. The encoder part of mFAE comprises a frame tokenizer and an utterance embedder. The frame tokenizer models linguistic content of input speech with a discrete categorical distribution. It performs frame clustering by assigning each frame a soft mixture label. The utterance embedder generates an utterance-level vector representation. A frame decoder serves to reconstruct speech features from the encoders' outputs. The mFAE is evaluated on speaker verification (SV) task and unsupervised subword modeling (USM) task. The SV experiments on VoxCeleb 1 show that the utterance embedder is capable of extracting speaker-discriminative embeddings with performance comparable to a x-vector baseline. The USM experiments on ZeroSpeech 2017 dataset verify that the frame tokenizer is able to capture linguistic content and the utterance embedder can acquire speaker-related information.Index Termsunsupervised deep factorization, mixture factorized auto-encoder, speaker verification, unsupervised subword modeling

show abstract

The CUHK-TUDELFT System for The SLT 2021 Children Speech Recognition Challenge

Ng¹,

Liu²,

Peng³

et al. 2020

Preprint

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Zhiyuan Peng

Combining Adversarial Training and Disentangled Speech Representation for Robust Zero-Resource Subword Modeling

Child Speech Disorder Detection with Siamese Recurrent Network Using Speech Attribute Features

Adversarial Multi-task Deep Features and Unsupervised Back-end Adaptation for Language Recognition

Mixture Factorized Auto-Encoder for Unsupervised Hierarchical Deep Factorization of Speech Signal

The CUHK-TUDELFT System for The SLT 2021 Children Speech Recognition Challenge

Contact Info

Product

Resources

About