2020
DOI: 10.1155/2020/8927031
|View full text |Cite
|
Sign up to set email alerts
|

Identification of Weakly Pitch-Shifted Voice Based on Convolutional Neural Network

Abstract: Pitch shifting is a common voice editing technique in which the original pitch of a digital voice is raised or lowered. It is likely to be abused by the malicious attacker to conceal his/her true identity. Existing forensic detection methods are no longer effective for weakly pitch-shifted voice. In this paper, we proposed a convolutional neural network (CNN) to detect not only strongly pitch-shifted voice but also weakly pitch-shifted voice of which the shifting factor is less than ±4 semitones. Specifically,… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
6
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(6 citation statements)
references
References 18 publications
0
6
0
Order By: Relevance
“…Finally, according to the number of Mel filters selected, the Mel filter bank matrix H m ( k ) with size of 513 × 128 was generated and multiplied by the results of STFT to obtain a power spectrum of Mel scale, i.e., a 59 × 128 real matrix X m ( t ). The calculation process is shown in Equations (3) and (4) [ 63 ]: where X m ( t ) denotes the power spectrum matrix, L denotes the number of data frames in the direction of time axis, M denotes the number of Mel filter banks, S t ( k ) denotes the spectrum obtained by STFT, H m ( k ) is the expression of Mel filter banks, where o ( m ), c ( m ), and h ( m ) are frequency points, and their interval determines the type of filter banks.…”
Section: Methodsmentioning
confidence: 99%
“…Finally, according to the number of Mel filters selected, the Mel filter bank matrix H m ( k ) with size of 513 × 128 was generated and multiplied by the results of STFT to obtain a power spectrum of Mel scale, i.e., a 59 × 128 real matrix X m ( t ). The calculation process is shown in Equations (3) and (4) [ 63 ]: where X m ( t ) denotes the power spectrum matrix, L denotes the number of data frames in the direction of time axis, M denotes the number of Mel filter banks, S t ( k ) denotes the spectrum obtained by STFT, H m ( k ) is the expression of Mel filter banks, where o ( m ), c ( m ), and h ( m ) are frequency points, and their interval determines the type of filter banks.…”
Section: Methodsmentioning
confidence: 99%
“…Wu et al [20] considered electronic means of pitch disguise and classified the speech into disguised or original speech using MFCC static and correlation features and a novel classification algorithm using support vector machines (SVM). Identification of weakly pitch shifted voice is performed in view of the forensic scenario [16]. We have also considered electronic pitch disguise in our work and extended the problem addressed in the study [20] for the classification of disguised voice into high pitch or low pitch voices.…”
Section: Pitch Disguisementioning
confidence: 99%
“…Identifying whether a given test speech is disguised or original is the first step in ASR from disguised voices. In some works, deep features and neural network classifiers are used for this classification [15][16][17][18]. This classification is done in literature using both prosodic and cepstral features [16,[18][19][20][21].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…In another investigation, scientists employed an auditory DA strategy to achieve an 82.6 percent accuracy for Mandarin-English code flipping ( Long et al, 2020 ). As presented in Ye et al (2020) pitch shifting is frequently utilized in DA and achieved 90% accuracy. In addition, Damskägg & Välimäki (2017) employed the time-stretched data augmentation approach when performing DA-based fuzzy identification on various audio signals.…”
Section: Introductionmentioning
confidence: 99%