2021
DOI: 10.1007/s00521-021-06083-7
|View full text |Cite
|
Sign up to set email alerts
|

End-to-end recurrent denoising autoencoder embeddings for speaker identification

Abstract: Speech 'in-the-wild' is a handicap for speaker recognition systems due to the variability induced by real-life conditions, such as environmental noise and emotions in the speaker. Taking advantage of representation learning, on this paper we aim to design a recurrent denoising autoencoder that extracts robust speaker embeddings from noisy spectrograms to perform speaker identification. The end-to-end proposed architecture uses a feedback loop to encode information regarding the speaker into low-dimensional rep… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2023
2023
2025
2025

Publication Types

Select...
3
1
1

Relationship

0
5

Authors

Journals

citations
Cited by 6 publications
(1 citation statement)
references
References 35 publications
0
1
0
Order By: Relevance
“…This can improve the accuracy and efficiency of speech recognition systems, especially in noisy or variable acoustic environments (Sayed et al 2023;Wubet and Lian 2022). Additionally, AEs can be used for speaker identification, where the AE can learn to distinguish between different speakers based on their speech patterns (Liao et al 2022;Rituerto-González and Peláez-Moreno 2021). A popular approach is using a CNN as the encoder to extract local features from the audio signal, and a RNN as the decoder to capture the temporal dependencies in the speech signal, with the output of the RNN decoder able to transcribe the speech signal (Palaz and Collobert 2015;Rusnac and Grigore 2022).…”
Section: Speech Processingmentioning
confidence: 99%
“…This can improve the accuracy and efficiency of speech recognition systems, especially in noisy or variable acoustic environments (Sayed et al 2023;Wubet and Lian 2022). Additionally, AEs can be used for speaker identification, where the AE can learn to distinguish between different speakers based on their speech patterns (Liao et al 2022;Rituerto-González and Peláez-Moreno 2021). A popular approach is using a CNN as the encoder to extract local features from the audio signal, and a RNN as the decoder to capture the temporal dependencies in the speech signal, with the output of the RNN decoder able to transcribe the speech signal (Palaz and Collobert 2015;Rusnac and Grigore 2022).…”
Section: Speech Processingmentioning
confidence: 99%