Identification of Weakly Pitch-Shifted Voice Based on Convolutional Neural Network

Ye, Yongchao; Lao, Lingjie; Yan, Diqun; Wang, Rangding

doi:10.1155/2020/8927031

Cited by 5 publications

(6 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finally, according to the number of Mel filters selected, the Mel filter bank matrix H m ( k ) with size of 513 × 128 was generated and multiplied by the results of STFT to obtain a power spectrum of Mel scale, i.e., a 59 × 128 real matrix X m ( t ). The calculation process is shown in Equations (3) and (4) [ 63 ]:

where X m ( t ) denotes the power spectrum matrix, L denotes the number of data frames in the direction of time axis, M denotes the number of Mel filter banks, S t ( k ) denotes the spectrum obtained by STFT, H m ( k ) is the expression of Mel filter banks, where o ( m ), c ( m ), and h ( m ) are frequency points, and their interval determines the type of filter banks.…”

Section: Methodsmentioning

confidence: 99%

A Parallel Classification Model for Marine Mammal Sounds Based on Multi-Dimensional Feature Extraction and Data Augmentation

Cai

Zhu

Zhang

et al. 2022

Sensors

View full text Add to dashboard Cite

Due to the poor visibility of the deep-sea environment, acoustic signals are often collected and analyzed to explore the behavior of marine species. With the progress of underwater signal-acquisition technology, the amount of acoustic data obtained from the ocean has exceeded the limit that human can process manually, so designing efficient marine-mammal classification algorithms has become a research hotspot. In this paper, we design a classification model based on a multi-channel parallel structure, which can process multi-dimensional acoustic features extracted from audio samples, and fuse the prediction results of different channels through a trainable full connection layer. It uses transfer learning to obtain faster convergence speed, and introduces data augmentation to improve the classification accuracy. The k-fold cross-validation method was used to segment the data set to comprehensively evaluate the prediction accuracy and robustness of the model. The evaluation results showed that the model can achieve a mean accuracy of 95.21% while maintaining a standard deviation of 0.65%. There was excellent consistency in performance over multiple tests.

show abstract

Section: Methodsmentioning

confidence: 99%

A Parallel Classification Model for Marine Mammal Sounds Based on Multi-Dimensional Feature Extraction and Data Augmentation

Cai

Zhu

Zhang

et al. 2022

Sensors

View full text Add to dashboard Cite

show abstract

“…Wu et al [20] considered electronic means of pitch disguise and classified the speech into disguised or original speech using MFCC static and correlation features and a novel classification algorithm using support vector machines (SVM). Identification of weakly pitch shifted voice is performed in view of the forensic scenario [16]. We have also considered electronic pitch disguise in our work and extended the problem addressed in the study [20] for the classification of disguised voice into high pitch or low pitch voices.…”

Section: Pitch Disguisementioning

confidence: 99%

“…Identifying whether a given test speech is disguised or original is the first step in ASR from disguised voices. In some works, deep features and neural network classifiers are used for this classification [15][16][17][18]. This classification is done in literature using both prosodic and cepstral features [16,[18][19][20][21].…”

Section: Introductionmentioning

confidence: 99%

“…In some works, deep features and neural network classifiers are used for this classification [15][16][17][18]. This classification is done in literature using both prosodic and cepstral features [16,[18][19][20][21]. Specific types of disguises are considered in most of the works like pitch disguised voices [16,[18][19][20], creaky voices [9,17], mimicked voices [15,21] etc.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Classification of Pitch and Gender of Speakers for Forensic Speaker Recognition from Disguised Voices Using Novel Features Learned by Deep Convolutional Neural Networks

Nair¹,

Savithri²

2021

View full text Add to dashboard Cite

Voice disguise is a major concern in forensic automatic speaker recognition (FASR). Classifying the type of disguise is very important for speaker recognition. Pitch disguise is a very common type of disguise that criminals try to attempt. Among the different types of disguises, high pitch and low pitch voices show more distortion. The features that are robust for high pitch and low pitch voices are different. Moreover, the effect of disguise on male and female voices are also different. In this work, we classified high pitch and low pitch disguised voices for male and female voices using a novel set of features. We arranged Mel frequency cepstral coefficients (MFCC), ΔMFCC, and ΔΔMFCC features as three-dimensional features, and these are given as the RGB equivalent spectrogram inputs to pretrained AlexNet deep convolutional neural network (DCNN). We fused the AlexNet output features with corresponding MFCC correlation features. These fused features are the proposed novel features for disguise classification. Classification using neural network (NN) and support vector machine (SVM) classifiers are performed. Simulation results show that classification with SVM classifier using these novel features gives improved accuracy of 98.89% compared to 95.99% accuracy obtained by using DCNN output features using traditional spectrogram inputs.

show abstract

“…In another investigation, scientists employed an auditory DA strategy to achieve an 82.6 percent accuracy for Mandarin-English code flipping ( Long et al, 2020 ). As presented in Ye et al (2020) pitch shifting is frequently utilized in DA and achieved 90% accuracy. In addition, Damskägg & Välimäki (2017) employed the time-stretched data augmentation approach when performing DA-based fuzzy identification on various audio signals.…”

Section: Introductionmentioning

confidence: 99%

Data augmentation and deep neural networks for the classification of Pakistani racial speakers recognition

Amjad

Khan

Chang

2022

PeerJ Computer Science

View full text Add to dashboard Cite

Speech emotion recognition (SER) systems have evolved into an important method for recognizing a person in several applications, including e-commerce, everyday interactions, law enforcement, and forensics. The SER system’s efficiency depends on the length of the audio samples used for testing and training. However, the different suggested models successfully obtained relatively high accuracy in this study. Moreover, the degree of SER efficiency is not yet optimum due to the limited database, resulting in overfitting and skewing samples. Therefore, the proposed approach presents a data augmentation method that shifts the pitch, uses multiple window sizes, stretches the time, and adds white noise to the original audio. In addition, a deep model is further evaluated to generate a new paradigm for SER. The data augmentation approach increased the limited amount of data from the Pakistani racial speaker speech dataset in the proposed system. The seven-layer framework was employed to provide the most optimal performance in terms of accuracy compared to other multilayer approaches. The seven-layer method is used in existing works to achieve a very high level of accuracy. The suggested system achieved 97.32% accuracy with a 0.032% loss in the 75%:25% splitting ratio. In addition, more than 500 augmentation data samples were added. Therefore, the proposed approach results show that deep neural networks with data augmentation can enhance the SER performance on the Pakistani racial speech dataset.

show abstract

Identification of Weakly Pitch-Shifted Voice Based on Convolutional Neural Network

Cited by 5 publications

References 18 publications

A Parallel Classification Model for Marine Mammal Sounds Based on Multi-Dimensional Feature Extraction and Data Augmentation

A Parallel Classification Model for Marine Mammal Sounds Based on Multi-Dimensional Feature Extraction and Data Augmentation

Classification of Pitch and Gender of Speakers for Forensic Speaker Recognition from Disguised Voices Using Novel Features Learned by Deep Convolutional Neural Networks

Data augmentation and deep neural networks for the classification of Pakistani racial speakers recognition

Contact Info

Product

Resources

About