Analysis of BUT Submission in Far-Field Scenarios of VOiCES 2019 Challenge

Matějka, Pavel; Plchot, Oldřich; Zeinali, Hossein; Mošner, Ladislav; Silnova, Anna; Burget, Lukáš; Novotny, Ondrej; Glembek, Ondřej

doi:10.21437/interspeech.2019-2471

Cited by 17 publications

(10 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For this purpose, using a public recipe in Kaldi is a reasonable choice. Readers can refer to [23], [24] for SOTA performance on the same task.…”

Section: Methodsmentioning

confidence: 99%

An MAP Estimation for Between-Class Variance

Han¹,

Cai²,

Li³

et al. 2021

Preprint

View full text Add to dashboard Cite

Probabilistic linear discriminant analysis (PLDA) has been widely used in open-set verification tasks, such as speaker verification. A potential issue of this model is that the training set often contains limited number of classes, which makes the estimation for the between-class variance unreliable. This unreliable estimation often leads to degraded generalization. In this paper, we present an MAP estimation for the between-class variance, by employing an Inverse-Wishart prior. A key problem is that with hierarchical models such as PLDA, the prior is placed on the variance of class means while the likelihood is based on class members, which makes the posterior inference intractable. We derive a simple MAP estimation for such a model, and test it in both PLDA scoring and length normalization. In both cases, the MAP-based estimation delivers interesting performance improvement.

show abstract

“…For this purpose, using a public recipe in Kaldi is a reasonable choice. Readers can refer to [23], [24] for SOTA performance on the same task.…”

Section: Methodsmentioning

confidence: 99%

An MAP Estimation for Between-Class Variance

Han¹,

Cai²,

Li³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…This suggests that there should be little difference between attentive For multi-head attentive pooling, the feature sequence ({hc,t} T −1 t=0 in ( 5)) in the first row corresponds to an utterance randomly selected from the VoxCeleb1 development set. For attentive STSP, the feature sequence is a random row vector in G of (12). Note that the unit in the horizontal axis is the frame index t in (5) and (9).…”

Section: A Performance On Various Evaluationsmentioning

confidence: 99%

Aggregating Frame-Level Information in the Spectral Domain With Self-Attention for Speaker Embedding

Tu¹,

Mak²

2021

Preprint

View full text Add to dashboard Cite

<pre><pre>Most pooling methods in state-of-the-art speaker embedding networks are implemented in the temporal domain. However, due to the high non-stationarity in the feature maps produced from the last frame-level layer, it is not advantageous to use the global statistics (e.g., means and standard deviations) of the temporal feature maps as aggregated embeddings. This motivates us to explore stationary spectral representations and perform aggregation in the spectral domain. In this paper, we propose attentive short-time spectral pooling (attentive STSP) from a Fourier perspective to exploit the local stationarity of the feature maps. In attentive STSP, for each utterance, we compute the spectral representations through a weighted average of the windowed segments within each spectrogram by attention weights and aggregate their lowest spectral components to form the speaker embedding. Because most energy of the feature maps is concentrated in the low-frequency region in the spectral domain, attentive STSP facilitates the information aggregation by retaining the low spectral components only. Moreover, due to the segment-level attention mechanism, attentive STSP can produce smoother attention weights (weights with less variations) than attentive pooling and generalize better to unseen data, making it more robust against the adverse effect of the non-stationarity in the feature maps. Attentive STSP is shown to consistently outperform attentive pooling on VoxCeleb1, VOiCES19-eval, SRE16-eval, and SRE18-CMN2-eval. This observation suggests that applying segment-level attention and leveraging low spectral components can produce discriminative speaker embeddings.</pre></pre>

show abstract

“…Almost all systems proposed during the challenge exploited different architectures of neural networks to obtain deep speaker representations. To reduce the effects of room reverberation and various kinds of distortions, some researches use more accurate task-oriented data augmentation [15,16,17,18] and speech enhancement methods [16] based on single-channel weighted prediction error (WPE) [19].…”

Section: Speaker Embeddings For Distant Speaker Recognitionmentioning

confidence: 99%

Deep Speaker Embeddings for Far-Field Speaker Recognition on Short Utterances

Gusev¹,

Volokhov²,

Andzhukaev³

et al. 2020

The Speaker and Language Recognition Workshop (Odyssey 2020)

View full text Add to dashboard Cite

Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions according to the results obtained for early NIST SRE (Speaker Recognition Evaluation) datasets. From the practical point of view, taking into account the increased interest in virtual assistants (such as Amazon Alexa, Google Home, Apple Siri, etc.), speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks. This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system quality degradation for short utterances. For these purposes, we considered deep neural network architectures based on TDNN (Time Delay Neural Network) and ResNet (Residual Neural Network) blocks. We experimented with state-of-the-art embedding extractors and their training procedures. Obtained results confirm that ResNet architectures outperform the standard x-vector approach in terms of speaker verification quality for both longduration and short-duration utterances. We also investigate the impact of speech activity detector, different scoring models, adaptation and score normalization techniques. The experimental results are presented for publicly available data and verification protocols for the VoxCeleb1, VoxCeleb2, and VOiCES datasets.

show abstract

Analysis of BUT Submission in Far-Field Scenarios of VOiCES 2019 Challenge

Cited by 17 publications

References 19 publications

An MAP Estimation for Between-Class Variance

An MAP Estimation for Between-Class Variance

Aggregating Frame-Level Information in the Spectral Domain With Self-Attention for Speaker Embedding

Deep Speaker Embeddings for Far-Field Speaker Recognition on Short Utterances

Contact Info

Product

Resources

About