Speech Gender Classification Using Bidirectional Long Short Term Memory

Alamsyah, Rangga Dwi; Suyanto, Suyanto

doi:10.1109/isriti51436.2020.9315380

Cited by 13 publications

(5 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Initially, the mixed audio (mixture) undergoes STFT to obtain the time frequency spectrogram. Then, a neural network consisting of four layers of bidirectional long short-term memory (BLSTM) [21] and fully connected layers is used to record the mapping from the time frequency spectrogram to an embedding space. In this embedding space, the mapping of the spectrogram is represented using an embedding matrix V, and k represents the dimensionality of the embedding space.…”

Section: Time Frequency Domain-based Modelmentioning

confidence: 99%

Target Speaker Extraction Using Attention-Enhanced Temporal Convolutional Network

Wang,

Lai,

Tai

et al. 2024

Electronics

View full text Add to dashboard Cite

When recording conversations, there may be multiple people talking at once. While our human ears can filter out unwanted sounds, this can be challenging for automatic speech recognition (ASR) systems, leading to reduced accuracy. To address this issue, preprocessing mechanisms such as speech separation and targeted speaker extraction are necessary to separate each person’s speech. With the development of deep learning, the quality of separated speech has improved significantly. Our objective is to focus on speaker extraction, which entails implementing a primary system for speech extraction and a secondary subsystem for delivering target information. To accomplish this, we have chosen a temporal convolutional network (TCN) architecture as the foundation of our speech extraction model. A TCN enables convolutional neural networks (CNNs) to manage time series modeling, and it can be constructed in various model lengths. Furthermore, we have integrated attention enhancement into the secondary subsystem to provide the speech extraction model with comprehensive and effective target information, which helps to improve the model’s ability to estimate masks. As a result, the quality of the target speaker extraction will be greatly enhanced with a more precise mask.

show abstract

Section: Time Frequency Domain-based Modelmentioning

confidence: 99%

Target Speaker Extraction Using Attention-Enhanced Temporal Convolutional Network

Wang,

Lai,

Tai

et al. 2024

Electronics

View full text Add to dashboard Cite

show abstract

“…The performance of MFCC with other extraction methods has been tested [11] and the results show MFCC outperforms other methods. From [1,5,19] and several other studies also use MFCC as a feature extraction method. The stages of MFCC [2,6] can be seen in Figure 2.…”

Section: Feature Extractionmentioning

confidence: 99%

“…This reshaping technique gives 0.4% improvement with 1D to 3D signal preprocessing as CNN input. Another deep learning architecture is Bidirectional Long Short-Term Memory (BLSTM) with a division of training and testing datasets of 80:20 resulting in the highest accuracy of 90.5% [19].…”

Section: Introductionmentioning

confidence: 99%

A Robust Gender Recognition System using Convolutional Neural Network on Indonesian Speaker

Switrayana,

Hadi,

Sulistianingsih

2024

SISTEMASI

View full text Add to dashboard Cite

Voice is one of the biometrics that humans have. Humans can be recognized by the sounds produced by their vocal cords and vocal tracts. One of the uses of voice is to recognize gender. Despite extensive research, gender recognition using machine learning remains unsatisfactory due to the complexity of voice features and the limitations of conventional algorithms. In this research, voice-based gender recognition is performed by applying deep learning. The deep learning model used is the Convolutional Neural Network (CNN). The input of CNN is the result of feature extraction from the Mel-Frequency Cepstral Coefficients (MFCC) method. MFCC produces Mel-Spectograms which are important features of sound. The dataset used is Indonesian speech. In the research, there are imbalanced and balanced dataset scenarios to see the performance of the model. To produce a balanced dataset, random undersampling is performed on the majority class. In addition, the effect of dividing training and testing data with a composition of 70:30, 80:20, and 90:10 was observed. The results show that the model has 100% accuracy for all imbalanced dataset scenarios. Then the highest accuracy is 99.65% for the balanced dataset scenario with 70:30 splitting. In summary, it can be concluded that CNN performs very well in identifying gender from voice features overall, although its performance decreases when random undersampling is applied to the dataset.

show abstract

“…BLSTM has several applications in voice recognition. Examples include speech gender classification [28], speech emotion recognition [29]. and native language identification in brief speech utterances [30].…”

Section: Introductionmentioning

confidence: 99%

Combined Bidirectional Long Short-Term Memory with Mel-Frequency Cepstral Coefficients Using Autoencoder for Speaker Recognition

et al. 2023

View full text Add to dashboard Cite

Recently, neural network technology has shown remarkable progress in speech recognition, including word classification, emotion recognition, and identity recognition. This paper introduces three novel speaker recognition methods to improve accuracy. The first method, called long short-term memory with mel-frequency cepstral coefficients for triplet loss (LSTM-MFCC-TL), utilizes MFCC as input features for the LSTM model and incorporates triplet loss and cluster training for effective training. The second method, bidirectional long short-term memory with mel-frequency cepstral coefficients for triplet loss (BLSTM-MFCC-TL), enhances speaker recognition accuracy by employing a bidirectional LSTM model. The third method, bidirectional long short-term memory with mel-frequency cepstral coefficients and autoencoder features for triplet loss (BLSTM-MFCCAE-TL), utilizes an autoencoder to extract additional AE features, which are then concatenated with MFCC and fed into the BLSTM model. The results showed that the performance of the BLSTM model was superior to the LSTM model, and the method of adding AE features achieved the best learning effect. Moreover, the proposed methods exhibit faster computation times compared to the reference GMM-HMM model. Therefore, utilizing pre-trained autoencoders for speaker encoding and obtaining AE features can significantly enhance the learning performance of speaker recognition. Additionally, it also offers faster computation time compared to traditional methods.

show abstract

Speech Gender Classification Using Bidirectional Long Short Term Memory

Cited by 13 publications

References 33 publications

Target Speaker Extraction Using Attention-Enhanced Temporal Convolutional Network

Target Speaker Extraction Using Attention-Enhanced Temporal Convolutional Network

A Robust Gender Recognition System using Convolutional Neural Network on Indonesian Speaker

Combined Bidirectional Long Short-Term Memory with Mel-Frequency Cepstral Coefficients Using Autoencoder for Speaker Recognition

Contact Info

Product

Resources

About