Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-2512
|View full text |Cite
|
Sign up to set email alerts
|

Attention-Based Speaker Embeddings for One-Shot Voice Conversion

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
12
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 9 publications
(12 citation statements)
references
References 12 publications
0
12
0
Order By: Relevance
“…To obtain phoneme-level style information, the key idea is an attention mechanism [21] relating style to content [22,23]. Our approach similarly assumes that the style information is related to the content, so instead of only using a fixed-length vector to represent the style of the whole utterance, the style information should rely on the content and change with time.…”
Section: Style To Phoneme Attentionmentioning
confidence: 99%
“…To obtain phoneme-level style information, the key idea is an attention mechanism [21] relating style to content [22,23]. Our approach similarly assumes that the style information is related to the content, so instead of only using a fixed-length vector to represent the style of the whole utterance, the style information should rely on the content and change with time.…”
Section: Style To Phoneme Attentionmentioning
confidence: 99%
“…If the performance of VC is not precise enough, voice augmentation for VC is possible. Different augmentations techniques for VC were proposed such as attention-based speaker embeddings for one-shot VC and data augmentationbased non-parallel VC [27]- [28].…”
Section: Introductionmentioning
confidence: 99%
“…Based on the U-net [29] structure, Li et al [30], Wu et al [24], and Li et al [28] extract utterancelevel speaker representations from multiple stacking layers and feed them to corresponding decoder layers. To access timevarying speaker information, some studies [25]- [27] extract variable-length speaker representation and fuse it into the converted speech according to the content-based alignment between source speech and target speaker speech.…”
Section: Introductionmentioning
confidence: 99%
“…Moreover, speech production research [31] shows that the frequency distributions of speech from different speakers lead to varying speaker timbre information in frequency channels [32]- [34]. Speech content, such as vowels, consonants, and para-linguistic features, carry distinct speaker timbre information reflected in temporal and frequency channel dimensions while silent speech segments apparently convey no speaker timbre information [27]. On the other hand, the human speech production mechanism is hierarchical [31], [35] in nature from long-term airflow generation to fine-grained phoneme-related articulator movements and vocal filtering.…”
Section: Introductionmentioning
confidence: 99%