ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9413421
|View full text |Cite
|
Sign up to set email alerts
|

An Effective Deep Embedding Learning Method Based on Dense-Residual Networks for Speaker Verification

Abstract: In this paper, we present an effective end-to-end deep embedding learning method based on Dense-Residual networks, which combine the advantages of a densely connected convolutional network (DenseNet) and a residual network (ResNet), for speaker verification (SV). Unlike a model ensemble strategy which merges the results of multiple systems, the proposed Dense-Residual networks perform feature fusion on every basic DenseR building block. Specifically, two types of DenseR blocks are designed. A sequential-DenseR… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
3
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
2
2
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 12 publications
(3 citation statements)
references
References 15 publications
0
3
0
Order By: Relevance
“…These features mainly included the X-vector learned by a Time-Delay Neural Network (TDNN) [19]- [23] or an Emphasized Channel Attention, Propagation and Aggregation in TDNN (ECAPA-TDNN) [24]; the R-vector learned by a Residual Network with 34 layers (ResNet34) [25]; the S-vector learned by a Transformer [26]. In addition, other kinds of neural networks were adopted to learn deep embeddings [27]- [35], such as temporal dynamic convolutional neural network [31], Attentive Multi-scale Convolutional Recurrent Network (AMCRN) [33], Siamese neural network [34], and long short-term memory network [35].…”
Section: Related Workmentioning
confidence: 99%
“…These features mainly included the X-vector learned by a Time-Delay Neural Network (TDNN) [19]- [23] or an Emphasized Channel Attention, Propagation and Aggregation in TDNN (ECAPA-TDNN) [24]; the R-vector learned by a Residual Network with 34 layers (ResNet34) [25]; the S-vector learned by a Transformer [26]. In addition, other kinds of neural networks were adopted to learn deep embeddings [27]- [35], such as temporal dynamic convolutional neural network [31], Attentive Multi-scale Convolutional Recurrent Network (AMCRN) [33], Siamese neural network [34], and long short-term memory network [35].…”
Section: Related Workmentioning
confidence: 99%
“…For example, Gupta et al [14] proposed a residual connection between input and the output of the inception block to have a continuity between the previously observed pose and the next predicted pose. Combined the advantages of DenseNet and ResNet with dense connection, Liu et al [15] designed sequential Dense-Residual blocks and parallel Dense-Residual blocks for feature fusion. A multimodal emotion recognition model based on feature fusion and residual connection was proposed by Du et al [16], which uses Bi-LSTM and Muti-head attention to extract key features in voice, text, and video, and combined residual connection mechanism to prevent gradient from disappearing.…”
Section: Residual Connectionmentioning
confidence: 99%
“…The main idea of attention mechanism is to introduce a dynamic weighting mechanism in the network that allows the model to perform different weighted calculations based on different parts of the input, thus enabling the network to focus more on important information and ignore irrelevant information [2,10,11]. On the other side, in a traditional deep neural network, it may lead to loss of information and degradation of the model as the depth of the network increases, so residual learning is introduced where multi-scale and multi-level information are integrated through residual connections [9,[14][15][16].…”
mentioning
confidence: 99%