Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-1103
|View full text |Cite
|
Sign up to set email alerts
|

Variational Attention Using Articulatory Priors for Generating Code Mixed Speech Using Monolingual Corpora

Abstract: Code Mixing-phenomenon where lexical items from one language are embedded in the utterance of another-is relatively frequent in multilingual communities and therefore speech systems should be able to process such content. However, building a voice capable of synthesizing such content typically requires bilingual recordings from the speaker which might not always be easy to obtain. In this work, we present an approach for building mixed lingual systems using only monolingual corpora. Specifically we present a w… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(2 citation statements)
references
References 16 publications
0
2
0
Order By: Relevance
“…The Gaussian posterior (GP) and i-vector approaches are the implicit approaches, whereas the phoneme posterior sequence (PS) extracted from the n-gram model is an explicit approach. In [16] and [17], the work uses bottleneck features (BNF) extracted from the trained ASR as the language representation and latent features with variational Bayes encoder to perform CSD, CSUD, and LD tasks. In [10], [18], [19] and [13], the works use deep learning architectures like the transformer, deepspeech2, and x-vector with deep clustering to implicitly model the language information for performing SLID and CSUD tasks.…”
Section: Introductionmentioning
confidence: 99%
“…The Gaussian posterior (GP) and i-vector approaches are the implicit approaches, whereas the phoneme posterior sequence (PS) extracted from the n-gram model is an explicit approach. In [16] and [17], the work uses bottleneck features (BNF) extracted from the trained ASR as the language representation and latent features with variational Bayes encoder to perform CSD, CSUD, and LD tasks. In [10], [18], [19] and [13], the works use deep learning architectures like the transformer, deepspeech2, and x-vector with deep clustering to implicitly model the language information for performing SLID and CSUD tasks.…”
Section: Introductionmentioning
confidence: 99%
“…Unsupervised speaker adaptation methods, such as speaker-adaptive TTS model conditioning on neural speaker embedding [17], have also shown promising results for the cross-lingual scenario. Cross-lingual TTS is also the foundation for more interesting applications such as code-mixing speech synthesis [18,19].…”
Section: Introductionmentioning
confidence: 99%