2021
DOI: 10.1016/j.asoc.2021.107141
|View full text |Cite
|
Sign up to set email alerts
|

CASA-based speaker identification using cascaded GMM-CNN classifier in noisy and emotional talking conditions

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
18
0
2

Year Published

2021
2021
2024
2024

Publication Types

Select...
7
1

Relationship

2
6

Authors

Journals

citations
Cited by 57 publications
(20 citation statements)
references
References 40 publications
0
18
0
2
Order By: Relevance
“…In the proposed work, the extracted features for the speech signal optimum representation are the Mel-frequency cepstral coefficients (MFCC) [2]. MFCC is a fundamental feature that is utilized in speaker and emotion recognition by virtue of the advanced representation of human auditory perception it provides [31][32][33]. MFCC is based on human hearing perceptions, which means that it relies on human listening features that cannot perceive frequencies over 1000 Hz.…”
Section: Feature Extractionmentioning
confidence: 99%
“…In the proposed work, the extracted features for the speech signal optimum representation are the Mel-frequency cepstral coefficients (MFCC) [2]. MFCC is a fundamental feature that is utilized in speaker and emotion recognition by virtue of the advanced representation of human auditory perception it provides [31][32][33]. MFCC is based on human hearing perceptions, which means that it relies on human listening features that cannot perceive frequencies over 1000 Hz.…”
Section: Feature Extractionmentioning
confidence: 99%
“…The major motives for employing dimensionality reduction in machine learning are to enhance each of the prediction performance and the learning efficiency, to deliver faster prediction demanding less information on the original data, to decrease complexity and time of the learning outcomes and allow well understanding of the underlying procedure. This is very important when the input vector is large such as speech processing related problems [9], [10]. Lower data dimensions lead to less computing time and complexity with much less storage.…”
Section: Figure 1 Dimensionality Reduction Taxonomymentioning
confidence: 99%
“…The t-SNE transforms high-dimensional Euclidean distances into conditional probabilities showing data similarity for each set using Stochastic Neighbor Embedding (SNE) [21]. The conditional probability p ୟ|ୠ , defined in the equation below, exemplifies the resemblance of data x ୟ to data x ୠ [20]: (10) Equation (10) calculates the distance between two data points x ୟ and x ୠ using a Gaussian distribution over x ୠ and a given variance of σ ଶ , where it differs for each data set and is chosen so that data from dense areas have smaller variance than data from sparse areas [20]. Then, a "Student t-distribution" is utilized as a substitute of utilizing the Gaussian distribution with one degree of freedom, close to the Cauchy distribution, is used to get the second set of probabilities ( Q ୟ|ୠ ) in the low dimension space [22].…”
Section: T-distributed Stochastic Neighbor Embedding (T-sne)mentioning
confidence: 99%
“…A maioria dos trabalhos testa corrupção de sinais por ruídos apenas de maneira aditiva [12][13][14][15]. Contudo, a simples adição de ruídos não representa ambientes reais, pois os ruídos também são afetados pela reverberação de salas.…”
Section: Introductionunclassified
“…Atualmente, trabalhos que abordam situações reais utilizam bases de dados que já fornecem dados em condições de ambientes reais [12][13][14][15], como SITW [16] e NIST 2010 retransmitted [17]. Dessa forma, modelos obtém taxas de erro abaixo de 10% em seus experimentos, e podem ser consideravelmente piores no momento de utilização de sistemas por voz [18].…”
Section: Introductionunclassified