ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414704
|View full text |Cite
|
Sign up to set email alerts
|

Cam: Context-Aware Masking for Robust Speaker Verification

Abstract: Performance degradation caused by noise has been a long-standing challenge for speaker verification. Previous methods usually involve applying a denoising transformation to speaker embeddings or enhancing input features. Nevertheless, these methods are lossy and inefficient for speaker embedding. In this paper, we propose contextaware masking (CAM), a novel method to extract robust speaker embedding. CAM enables the speaker embedding network to "focus" on the speaker of interest and "blur" unrelated noise. The… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3
1

Relationship

1
6

Authors

Journals

citations
Cited by 10 publications
(5 citation statements)
references
References 26 publications
0
5
0
Order By: Relevance
“…Masked filtering allows us to extract relatively clean embeddings from overlapping speech. This is well illustrated in our previous work [19]. Second, if there is only one speaker in the entire segment, it is of our best interest to include as much information as possible.…”
Section: "Winner Takes All" Masked Filteringmentioning
confidence: 61%
See 2 more Smart Citations
“…Masked filtering allows us to extract relatively clean embeddings from overlapping speech. This is well illustrated in our previous work [19]. Second, if there is only one speaker in the entire segment, it is of our best interest to include as much information as possible.…”
Section: "Winner Takes All" Masked Filteringmentioning
confidence: 61%
“…The mask prediction network is trained on the same setup as described in [19]. The only difference is that, instead of taking a rough mean pooling as the target embedding, we conduct a 2-class clustering and only run mean pooling on the dominating class.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…1) External adjustment: In ASV, models with a "BN-ReLU-TDNN" structure are often used to achieve better performance [32], [36]. However, BN and TDNN are separated by a non-linear activation function which makes the two linear operators can not be combined in one sequential layer [32].…”
Section: B Re-parameterization For Tmsmentioning
confidence: 99%
“…The task of speaker verification aims to determine whether a speaker and a registered identity are the same person [1]. Currently, the mainstream approach to speaker verification relies on deep learning techniques [2][3][4][5][6]. This method initially employs deep learning to extract speaker embeddings, and then estimates the similarity score between the speaker and the registered identity using cosine similarity, thereby determining whether they are the same speaker [7].…”
Section: Introductionmentioning
confidence: 99%