2021
DOI: 10.1109/access.2021.3137190
|View full text |Cite
|
Sign up to set email alerts
|

Bootstrap Equilibrium and Probabilistic Speaker Representation Learning for Self-Supervised Speaker Verification

Abstract: In this paper, we propose self-supervised speaker representation learning strategies, which comprise of a bootstrap equilibrium speaker representation learning in the front-end and an uncertaintyaware probabilistic speaker embedding training in the back-end. In the front-end stage, we learn the speaker representations via the bootstrap training scheme with the uniformity regularization term. In the backend stage, the probabilistic speaker embeddings are estimated by maximizing the mutual likelihood score betwe… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
1
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(10 citation statements)
references
References 31 publications
0
1
0
Order By: Relevance
“…Considering that state-of-the-art methods (Rao et al 2020;Chen et al 2021;Mun et al 2022) , where y t = 1 represents the boundary, otherwise y t = 0. To this end, we first introduce a context-aware Transformer encoder to model the contextual information, and then propose a self-supervised learning scheme with shot-to-scene pretext tasks to learn discriminative shot representations for segmentation.…”
Section: Proposed Methodsmentioning
confidence: 99%
See 3 more Smart Citations
“…Considering that state-of-the-art methods (Rao et al 2020;Chen et al 2021;Mun et al 2022) , where y t = 1 represents the boundary, otherwise y t = 0. To this end, we first introduce a context-aware Transformer encoder to model the contextual information, and then propose a self-supervised learning scheme with shot-to-scene pretext tasks to learn discriminative shot representations for segmentation.…”
Section: Proposed Methodsmentioning
confidence: 99%
“…For example, (Chen et al 2021) presented a self-supervised shot embedding approach to learn a shot representation that maximizes the similarity between nearby shots compared to randomly selected shots. (Mun et al 2022) pre-trained a transformer encoder with pseudo-boundaries, and then fine-tuned the encoder with labeled data. Nevertheless, these methods always adopted sophisticated model architectures, without carefully considering the contextual information of the longterm video.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…The emergence of self-supervision methods in deep learning has also been applied to training speaker embedding extractors [8,9,10,11,12]. Several approaches have been examined, some of which employ an audiovisual setting [13,14].…”
Section: Introductionmentioning
confidence: 99%