The Speaker and Language Recognition Workshop (Odyssey 2018) 2018
DOI: 10.21437/odyssey.2018-11
|View full text |Cite
|
Sign up to set email alerts
|

Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System

Abstract: In this paper, we explore the encoding/pooling layer and loss function in the end-to-end speaker and language recognition system. First, a unified and interpretable end-to-end system for both speaker and language recognition is developed. It accepts variable-length input and produces an utterance level result. In the end-to-end system, the encoding layer plays a role in aggregating the variable-length input sequence into an utterance level representation. Besides the basic temporal average pooling, we introduc… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
265
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
4
1

Relationship

2
7

Authors

Journals

citations
Cited by 284 publications
(266 citation statements)
references
References 35 publications
1
265
0
Order By: Relevance
“…An encoding layer is then applied to the top of it to get the utterance level representation. The most common encoding method is the average pooling layer, which aggregates the statistics (i.e., mean, or mean and standard deviation) [1,2].…”
Section: Revisit: Deep Speaker Embeddingmentioning
confidence: 99%
“…An encoding layer is then applied to the top of it to get the utterance level representation. The most common encoding method is the average pooling layer, which aggregates the statistics (i.e., mean, or mean and standard deviation) [1,2].…”
Section: Revisit: Deep Speaker Embeddingmentioning
confidence: 99%
“…The superiority of deep speaker embedding systems has been shown in text-independent speaker recognition for closed talking [21,22] and far-field scenarios [24,25]. In this paper, we Figure 2: Gender and age distribution adopt the deep speaker embedding system, which is initially designed for the text-independent speaker verification, as baseline.…”
Section: Model Architecturementioning
confidence: 99%
“…The single-channel network structure is the same as in [22]. There are three main components in this framework.…”
Section: Model Architecturementioning
confidence: 99%
“…One aspect of our study is therefore an attempt to find out how effective these recent developments in speaker verification are for speaker adaption in TTS. More specifically we investigate the capability of neural speaker embeddings [16,17,19] to capture and model characteristics of speakers that were unseen during TTS model training. For this purpose, we extend an improved Tacotron system in [28] to a multi-speaker TTS system and conduct systematic analysis to answer the above question.…”
Section: Introductionmentioning
confidence: 99%