ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414722
|View full text |Cite
|
Sign up to set email alerts
|

Siamese Capsule Network for End-to-End Speaker Recognition in the Wild

Abstract: We propose an end-to-end deep model for speaker verification in the wild. Our model uses thin-ResNet for extracting speaker embeddings from utterances and a Siamese capsule network and dynamic routing as the back-end to calculate a similarity score between the embeddings. We conduct a series of experiments and comparisons on our model versus baseline solutions, showing that our model outperforms the benchmarks using substantially less amount of training data. We also perform additional experiments to study the… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
7
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 15 publications
(7 citation statements)
references
References 28 publications
0
7
0
Order By: Relevance
“…Their uses include granting access to individuals [1] who intend to use the products/services provided by the smart devices or customize the provided services by personalizing the experience towards each user [2]. Recently, deep neural networks (DNN) have become the predominant mechanism used in SR systems [3,4,5,6,7,8]. This is mainly due to factors such as improved performance in comparison to traditional SR techniques as a result of state-of-the-art neural architectures and loss functions used in training [9].…”
Section: Introductionmentioning
confidence: 99%
“…Their uses include granting access to individuals [1] who intend to use the products/services provided by the smart devices or customize the provided services by personalizing the experience towards each user [2]. Recently, deep neural networks (DNN) have become the predominant mechanism used in SR systems [3,4,5,6,7,8]. This is mainly due to factors such as improved performance in comparison to traditional SR techniques as a result of state-of-the-art neural architectures and loss functions used in training [9].…”
Section: Introductionmentioning
confidence: 99%
“…Other research works on in-the-wild scenarios include those of Hajavi &and Etemad [ 23 ] on speaker-recognition tasks and Nguyen et al [ 24 ] for animal-recognition tasks. Hajavi and Etemad 2021 [ 23 ] proposed a Siamese network architecture using capsules and dynamic routing for speaker verification in the wild. Their experimental results on the VoxCeleb dataset gave an error rate (EER) of 3.14%.…”
Section: Related Work On Multimodal-sensor Architectures For Speech R...mentioning
confidence: 99%
“…D EEP audio representation learning has recently attracted significant interest, specially in applications such as speaker recognition (SR) [9], [10], [43], [31], [11] and speech emotion recognition (SER) [1], [18], [25]. The goal in deep audio representation learning is to learn embeddings from audio or visual signals, which could be used in retrieving…”
Section: Introductionmentioning
confidence: 99%
“…information such as identity or the emotional state of the speaker. This goal is generally best achieved when multimodal audio-visual inputs are used [31], [19], [2] as opposed to when only a single modality of audio or video is used [9], [10], [43], [23], [24], [11]. Nonetheless, in many realworld scenarios, both modalities may not be simultaneously available at inference, resulting in the inability of the model to perform effectively.…”
Section: Introductionmentioning
confidence: 99%