Attention-Based Models for Text-Dependent Speaker Verification

Chowdhury, Farhan Asif; Wang, Quan; Moreno, Ignacio López; Wan, Li

doi:10.48550/arxiv.1710.10470

Cited by 20 publications

(27 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Attention mechanism for speaker verification has been investigated in recent papers. In [26], several methods were proposed for using attention in an LSTM-based text-dependent speaker verification. A slightly different strategy for adding attention to the x-vector topology was proposed in [27] while single and multi-head attentions were investigated for TI-SV.…”

Section: Using Two Types Of Attentionmentioning

confidence: 99%

“…Here, we only consider single-head attention in two modes. The first one is the same as [27] while for the second one we doubled the size of last hidden layer before pooling and equally split its dimension into two parts like [26] and use the first part for calculating attention weights (i.e. keys) and the second part for calculating mean and standard deviation statistics (i.e.…”

Section: Using Two Types Of Attentionmentioning

confidence: 99%

See 1 more Smart Citation

How to Improve Your Speaker Embeddings Extractor in Generic Toolkits

Zeinali¹,

Burget²,

Rohdin³

et al. 2018

Preprint

View full text Add to dashboard Cite

Recently, speaker embeddings extracted with deep neural networks became the state-of-the-art method for speaker verification. In this paper we aim to facilitate its implementation on a more generic toolkit than Kaldi, which we anticipate to enable further improvements on the method. We examine several tricks in training, such as the effects of normalizing input features and pooled statistics, different methods for preventing overfitting as well as alternative nonlinearities that can be used instead of Rectifier Linear Units. In addition, we investigate the difference in performance between TDNN and CNN, and between two types of attention mechanism. Experimental results on Speaker in the Wild, SRE 2016 and SRE 2018 datasets demonstrate the effectiveness of the proposed implementation.

show abstract

Section: Using Two Types Of Attentionmentioning

confidence: 99%

Section: Using Two Types Of Attentionmentioning

confidence: 99%

How to Improve Your Speaker Embeddings Extractor in Generic Toolkits

Zeinali¹,

Burget²,

Rohdin³

et al. 2018

Preprint

View full text Add to dashboard Cite

show abstract

“…Inspired by the application of attention mechanism in speech recognition [29], speaker verification [30] and single channel keyword spotting [31], following [17] we incorporate a soft self-attention for projecting K + 1 channels' fbank feature vectors to one channel, so that KWS still takes one channel input vector similarly as the baseline single channel model. For each time-step, we compute a K + 1 dimensional attention weight vector α for input fbank feature vectors z = [z1, z2, .…”

Section: Joint Training With Kws Modelmentioning

confidence: 99%

End-to-End Multi-Look Keyword Spotting

Xuan

et al. 2020

Preprint

View full text Add to dashboard Cite

The performance of keyword spotting (KWS), measured in false alarms and false rejects, degrades significantly under the far field and noisy conditions. In this paper, we propose a multilook neural network modeling for speech enhancement which simultaneously steers to listen to multiple sampled look directions. The multi-look enhancement is then jointly trained with KWS to form an end-to-end KWS model which integrates the enhanced signals from multiple look directions and leverages an attention mechanism to dynamically tune the model's attention to the reliable sources. We demonstrate, on our large noisy and far-field evaluation sets, that the proposed approach significantly improves the KWS performance against the baseline KWS system and a recent beamformer based multi-beam KWS system.

show abstract

“…Only LSTM [11]and GRU [12] are used in our experiments for pair comparison. An attention mechanism [13] is applied to come up with an attention weight vector a = {a 1 , a 2 , a 3 , ..., a T }. Then C is the feature representation for the whole sequential input, which is computed as the weighted sum of h = {h 1 , h 2 , ..., h T }.…”

Section: The Baseline Modelmentioning

confidence: 99%

Sequence-to-sequence Models for Small-Footprint Keyword Spotting

Zhang,

Wang

2018

Preprint

View full text Add to dashboard Cite

In this paper, we propose a sequence-to-sequence model for keyword spotting (KWS). Compared with other end-to-end architectures for KWS, our model simplifies the pipelines of production-quality KWS system and satisfies the requirement of high accuracy, low-latency, and small-footprint. We also evaluate the performances of different encoder architectures, which include LSTM and GRU. Experiments on the real-world wake-up data show that our approach outperforms the recently proposed attention-based end-toend model. Specifically speaking, with ∼73K parameters, our sequence-to-sequence model achieves ∼3.05% false rejection rate (FRR) at 0.1 false alarm (FA) per hour.

show abstract

Attention-Based Models for Text-Dependent Speaker Verification

Cited by 20 publications

References 5 publications

How to Improve Your Speaker Embeddings Extractor in Generic Toolkits

How to Improve Your Speaker Embeddings Extractor in Generic Toolkits

End-to-End Multi-Look Keyword Spotting

Sequence-to-sequence Models for Small-Footprint Keyword Spotting

Contact Info

Product

Resources

About