An Efficient Reference Free Adaptive Learning Process for Speech Enhancement Applications

Jyoshna, Girika; Rahman, Md. Zıa Ur; KoteswaraRao, L.

doi:10.32604/cmc.2022.020160

Cited by 5 publications

(3 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This research trained 5 triphone models. The first-order difference and secondorder difference information of the acoustic features were added and recorded as the tri1 model; linear discriminant analysis (LDA) and maximum likelihood linear regression (MLLR) were used as two linear transformations for the features to create the tri2 model; speaker adaptation was added to the tri3 model (using LDA, MLLT + SAT); the tri4 model built a larger SAT model by adjusting parameters; and finally, the "quick fast training" script in Kaldi was used to train a larger scale GMM model, denoted the tri5p model [21][22][23].…”

Section: Phone Set and Corpusmentioning

confidence: 99%

Research on Tibetan Speech Recognition Based on the Am-do Dialect

Khysru¹,

Wei²,

Dang³

2022

Computers, Materials &Amp; Continua

View full text Add to dashboard Cite

In China, Tibetan is usually divided into three major dialects: the Am-do, Khams and Lhasa dialects. The Am-do dialect evolved from ancient Tibetan and is a local variant of modern Tibetan. Although this dialect has its own specific historical and social conditions and development, there have been different degrees of communication with other ethnic groups, but all the abovementioned dialects developed from the same language: Tibetan. This paper uses the particularity of Tibetan suffixes in pronunciation and proposes a lexicon for the Am-do language, which optimizes the problems existing in previous research. Audio data of the Am-do dialect are expanded by data augmentation technology combining noise and reverberation, and the morphological characteristics and characteristics of the Tibetan language are further considered. According to the particularity of Tibetan grammar, grammatical features are used to optimize grammatical relationships and are combined with a language model, and the Am-do dialect is scored and rescored. Experimental results show that compared with the baseline, our proposed new lexicon and data augmentation technology yields a relative increase of approximately 3% in character error rates (CERs) and a relative increase of 3%-19% in the recognition rate of acoustic models and language models.

show abstract

Section: Phone Set and Corpusmentioning

confidence: 99%

Research on Tibetan Speech Recognition Based on the Am-do Dialect

Khysru¹,

Wei²,

Dang³

2022

Computers, Materials &Amp; Continua

View full text Add to dashboard Cite

show abstract

“…Speech enhancement is a fundamental task in speech signal processing, which is widely used in various scenarios, e.g., mobile phone, intelligent vehicles [1] and medical devices [2,3]. It is performed as a front-end signal procedure for automatic speech recognition (ASR), speaker identification, hearingaid devices and cochlear implant.…”

Section: Introductionmentioning

confidence: 99%

Speech Enhancement via Mask-Mapping Based Residual Dense Network

Lin¹,

Chen²,

Wu³

et al. 2023

Computers, Materials &Amp; Continua

View full text Add to dashboard Cite

Masking-based and spectrum mapping-based methods are the two main algorithms of speech enhancement with deep neural network (DNN). But the mapping-based methods only utilizes the phase of noisy speech, which limits the upper bound of speech enhancement performance. Maskingbased methods need to accurately estimate the masking which is still the key problem. Combining the advantages of above two types of methods, this paper proposes the speech enhancement algorithm MM-RDN (maskingmapping residual dense network) based on masking-mapping (MM) and residual dense network (RDN). Using the logarithmic power spectrogram (LPS) of consecutive frames, MM estimates the ideal ratio masking (IRM) matrix of consecutive frames. RDN can make full use of feature maps of all layers. Meanwhile, using the global residual learning to combine the shallow features and deep features, RDN obtains the global dense features from the LPS, thereby improves estimated accuracy of the IRM matrix. Simulations show that the proposed method achieves attractive speech enhancement performance in various acoustic environments. Specifically, in the untrained acoustic test with limited priors, e.g., unmatched signal-to-noise ratio (SNR) and unmatched noise category, MM-RDN can still outperform the existing convolutional recurrent network (CRN) method in the measures of perceptual evaluation of speech quality (PESQ) and other evaluation indexes. It indicates that the proposed algorithm is more generalized in untrained conditions.

show abstract

“…In practical application scenarios, speech signals will inevitably be disturbed by many interference factors such as noise, echo, and reverberation. Therefore, speech enhancement technology has been widely used in household appliances, communications, speech recognition, automotive electronics, hearing aids, and other fields [1][2][3]. Traditional speech enhancement methods, based on signal processing and statistical modeling, have good performance for stationary noise.…”

Section: Introductionmentioning

confidence: 99%

Using Hybrid Penalty and Gated Linear Units to Improve Wasserstein Generative Adversarial Networks for Single-Channel Speech Enhancement

Zhu¹,

Huang²

2023

Computer Modeling in Engineering &Amp; Sciences

View full text Add to dashboard Cite

Recently, speech enhancement methods based on Generative Adversarial Networks have achieved good performance in time-domain noisy signals. However, the training of Generative Adversarial Networks has such problems as convergence difficulty, model collapse, etc. In this work, an end-to-end speech enhancement model based on Wasserstein Generative Adversarial Networks is proposed, and some improvements have been made in order to get faster convergence speed and better generated speech quality. Specifically, in the generator coding part, each convolution layer adopts different convolution kernel sizes to conduct convolution operations for obtaining speech coding information from multiple scales; a gated linear unit is introduced to alleviate the vanishing gradient problem with the increase of network depth; the gradient penalty of the discriminator is replaced with spectral normalization to accelerate the convergence rate of the model; a hybrid penalty term composed of L1 regularization and a scale-invariant signal-to-distortion ratio is introduced into the loss function of the generator to improve the quality of generated speech. The experimental results on both TIMIT corpus and Tibetan corpus show that the proposed model improves the speech quality significantly and accelerates the convergence speed of the model.

show abstract

An Efficient Reference Free Adaptive Learning Process for Speech Enhancement Applications

Cited by 5 publications

References 26 publications

Research on Tibetan Speech Recognition Based on the Am-do Dialect

Research on Tibetan Speech Recognition Based on the Am-do Dialect

Speech Enhancement via Mask-Mapping Based Residual Dense Network

Using Hybrid Penalty and Gated Linear Units to Improve Wasserstein Generative Adversarial Networks for Single-Channel Speech Enhancement

Contact Info

Product

Resources

About