2018 IEEE International Conference on Multimedia and Expo (ICME) 2018
DOI: 10.1109/icme.2018.8486441
|View full text |Cite
|
Sign up to set email alerts
|

Text-Independent Speaker Verification Using 3D Convolutional Neural Networks

Abstract: In this paper, a novel method using 3D Convolutional Neural Network (3D-CNN) architecture has been proposed for speaker verification in the text-independent setting. One of the main challenges is the creation of the speaker models. Most of the previously-reported approaches create speaker models based on averaging the extracted features from utterances of the speaker, which is known as the d-vector system. In our paper, we propose an adaptive feature learning by utilizing the 3D-CNNs for direct speaker model c… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
36
0
1

Year Published

2018
2018
2024
2024

Publication Types

Select...
5
2
2

Relationship

2
7

Authors

Journals

citations
Cited by 58 publications
(37 citation statements)
references
References 20 publications
0
36
0
1
Order By: Relevance
“…To address the problem, we propose to use the 3D Convolutional Neural Networks models that have recently been employed for action recognition, scene understanding, and speaker verification and demonstrated promising results [16]- [18]. 3D CNNs concurrently extract features from both spatial and temporal dimensions, so the motion information is captured and concatenated in adjacent frames.…”
Section: Introductionmentioning
confidence: 99%
“…To address the problem, we propose to use the 3D Convolutional Neural Networks models that have recently been employed for action recognition, scene understanding, and speaker verification and demonstrated promising results [16]- [18]. 3D CNNs concurrently extract features from both spatial and temporal dimensions, so the motion information is captured and concatenated in adjacent frames.…”
Section: Introductionmentioning
confidence: 99%
“…Continuous integration using for instant error check and validity of changes has been deployed for SpeechPy. Moreover, prior to the latest official release of SpeechPy, the package has successfully been utilized for research purposes [19,20].…”
Section: Overviewmentioning
confidence: 99%
“…The coupling of deep neural networks with large-scale labelled training datasets has produced a number of notable successes, yielding improved performance in speech related tasks such as ASR [1] and speaker verification [2,3]. However, the considerable cost of manually producing such labels ultimately limits the potential of fully supervised approaches.…”
Section: Introductionmentioning
confidence: 99%
“…In this work, we make the following contributions: (1) We propose a novel framework for learning speech representations capturing information at different time scales in the speech signal, including in particular the identity of the speaker; (2) we show that we can learn these representations from a large, unlabelled collection of "talking faces" in videos as a source of free supervision, without the need for any manual annotation; (3) we show that sharing a trunk architecture for two different tasks (content and speaker identity) and adding an explicit disentanglement objective between the two improves performance; and, (4) we evaluate the performance of our self-supervised embeddings on the popular VoxCeleb1 speaker recognition benchmark and compare to fully supervised methods. All data, code and models will be released.…”
Section: Introductionmentioning
confidence: 99%