Speaker Identification Using Discriminative Features and Sparse Representation

Chin, Yu-Hao; Wang, Jia-Ching; Huang, Chien‐Lin; Wang, Kuang-Yao; Wu, Chung-Hsien

doi:10.1109/tifs.2017.2678458

Cited by 18 publications

(10 citation statements)

References 52 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Then, the trained model is used to convert the noisy speech signals into the clean speech signals. Notable machinelearning-based SE methods include compressive sensing [26], sparse coding [27], [28], non-negative matrix factorization [29], and robust principal component analysis [30], [31].…”

Section: Introductionmentioning

confidence: 99%

Improved Lite Audio-Visual Speech Enhancement

Chuang

Wang

Tsao

2022

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Numerous studies have investigated the effectiveness of audio-visual multimodal learning for speech enhancement (AVSE) tasks, seeking a solution that uses visual data as auxiliary and complementary input to reduce the noise of noisy speech signals. Recently, we proposed a lite audio-visual speech enhancement (LAVSE) algorithm for a car-driving scenario. Compared to conventional AVSE systems, LAVSE requires less online computation and to some extent solves the user privacy problem on facial data. In this study, we extend LAVSE to improve its ability to address three practical issues often encountered in implementing AVSE systems, namely, the additional cost of processing visual data, audio-visual asynchronization, and low-quality visual data. The proposed system is termed improved LAVSE (iLAVSE), which uses a convolutional recurrent neural network architecture as the core AVSE model. We evaluate iLAVSE on the Taiwan Mandarin speech with video dataset. Experimental results confirm that compared to conventional AVSE systems, iLAVSE can effectively overcome the aforementioned three practical issues and can improve enhancement performance. The results also confirm that iLAVSE is suitable for real-world scenarios, where high-quality audio-visual sensors may not always be available.

show abstract

Section: Introductionmentioning

confidence: 99%

Improved Lite Audio-Visual Speech Enhancement

Chuang

Wang

Tsao

2022

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

show abstract

“…SE methods in the second class are based on machinelearning algorithms; these methods typically prepare a model for noisy-to-clean transformation in a data-driven manner. Notable SE methods belonging to this class include hidden Markov models [35], non-negative matrix factorization [36]- [38], compressive sensing [39], sparse coding [40], and robust principal component analysis [41]. In addition, artificial neural networks (ANNs), as a successful machine-learning model, have been used for SE because of their powerful nonlinear transformation capability.…”

Section: Introductionmentioning

confidence: 99%

CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application

et al. 2022

View full text Add to dashboard Cite

This study presents a deep learning-based speech signal-processing mobile application known as CITISEN. The CITISEN can perform three functions: speech enhancement (SE), model adaptation (MA), and background noise conversion (BNC), which allow CITISEN to be used as a platform for utilizing and evaluating SE models and flexibly extend the models to address various noise environments and users. For SE, CITISEN downloads pretrained SE models on the cloud server and then uses these models to effectively reduce noise components from instant or saved recordings provided by users. When it encounters noisy speech signals with unknown speakers or noise types, the MA function allows CITISEN to improve the SE performance effectively. A few audio files of unseen speakers or noise types are recorded and uploaded to the cloud server and then used to adapt the pretrained SE model. Finally, for the BNC, CITISEN removes the original background noise using an SE model, and then mixes the processed speech signal with new background noise. The novel BNC function can evaluate SE performance under specific conditions, cover people's tracks, and provide entertainment. The experimental results confirmed the effectiveness of SE, MA, and BNC functions. Compared with the noisy speech signals, the enhanced speech signals achieved about 6% and 33% of improvements, respectively, in terms of short-time objective intelligibility (STOI) and perceptual evaluation of speech quality (PESQ). With MA, the STOI and PESQ could be further improved by approximately 6% and 11%, respectively. Note that the SE model and MA method are not limited to the ones described in this study and can be replaced with any SE model and MA method. Finally, the BNC experiment results indicated that the speech signals converted from noisy and silent backgrounds have a close scene identification accuracy and similar embeddings in an acoustic scene classification model. Therefore, the proposed BNC can effectively convert the background noise of a speech signal and be a data augmentation method when clean speech signals are unavailable.INDEX TERMS speech enhancement, model adaptation, background noise conversion, deep learning, mobile application.

show abstract

“…The prepared model is used to transform noisy speech signals to clean speech signals. Well-known machine learning-based models include non-negative matrix factorization [21], [22], [23], compressive sensing [24], sparse coding [25], [26], and robust principal component analysis (RPCA) [27]. Deep learning models have drawn great interest due to their outstanding nonlinear mapping capabilities.…”

Section: Introductionmentioning

confidence: 99%

Speech Enhancement Based on Denoising Autoencoder With Multi-Branched Encoders

Zezario

Wang

et al. 2020

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Deep learning-based models have greatly advanced the performance of speech enhancement (SE) systems. However, two problems remain unsolved, which are closely related to model generalizability to noisy conditions: (1) mismatched noisy condition during testing, i.e., the performance is generally suboptimal when models are tested with unseen noise types that are not involved in the training data; (2) local focus on specific noisy conditions, i.e., models trained using multiple types of noises cannot optimally remove a specific noise type even though the noise type has been involved in the training data. These problems are common in real applications. In this paper, we propose a novel denoising autoencoder with a multi-branched encoder (termed DAEME) model to deal with these two problems. In the DAEME model, two stages are involved: training and testing. In the training stage, we build multiple component models to form a multi-branched encoder based on a decision tree (DSDT). The DSDT is built based on prior knowledge of speech and noisy conditions (the speaker, environment, and signal factors are considered in this paper), where each component of the multi-branched encoder performs a particular mapping from noisy to clean speech along the branch in the DSDT. Finally, a decoder is trained on top of the multi-branched encoder. In the testing stage, noisy speech is first processed by each component model. The multiple outputs from these models are then integrated into the decoder to determine the final enhanced speech. Experimental results show that DAEME is superior to several baseline models in terms of objective evaluation metrics, automatic speech recognition results, and quality in subjective human listening tests.

show abstract

Speaker Identification Using Discriminative Features and Sparse Representation

Cited by 18 publications

References 52 publications

Improved Lite Audio-Visual Speech Enhancement

Improved Lite Audio-Visual Speech Enhancement

CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application

Speech Enhancement Based on Denoising Autoencoder With Multi-Branched Encoders

Contact Info

Product

Resources

About