Interspeech 2016 2016
DOI: 10.21437/interspeech.2016-363
|View full text |Cite
|
Sign up to set email alerts
|

Phoneme Embedding and its Application to Speech Driven Talking Avatar Synthesis

Abstract: Word embedding has made great achievements in many natural language processing tasks. However, the attempt to apply word embedding to the field of speech got few breakthroughs. The reason is that word vectors mainly contain semantic and syntactic information. Such high level features are difficult to be directly incorporated in speech related tasks compared to acoustic or phoneme related features. In this paper, we investigate the method for phoneme embedding to generate phoneme vectors carrying acoustic infor… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2018
2018
2021
2021

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 7 publications
(6 citation statements)
references
References 11 publications
0
6
0
Order By: Relevance
“…The network details are listed as follow: pingyin-token embedding size 6, filter size 3x6 and numbers of filters 100. Different from [41] using one-hot vector, we choose distributed representation vectors to represent subword units. The experimental results with and without the proposed sub-word unit approach are in Table 4.…”
Section: Resultsmentioning
confidence: 99%
“…The network details are listed as follow: pingyin-token embedding size 6, filter size 3x6 and numbers of filters 100. Different from [41] using one-hot vector, we choose distributed representation vectors to represent subword units. The experimental results with and without the proposed sub-word unit approach are in Table 4.…”
Section: Resultsmentioning
confidence: 99%
“…In that work, a neural encoder-decoder model (Cho et al, 2014;Sutskever et al, 2014) was trained to perform a transformation of words from their citation forms to a 'target' inflected form and, after training, the vowels in the embedding layer of the long-short term memory (LSTM) neural model trained for Finnish were found to clearly group themselves according to known harmony patterns in the language. Li et al (2016) take advantage of phoneme transcriptions in a neural speech synthesis application, showing improvements on this task and indicating that similar phonemes in a bidirectional LSTM (Bi-LSTM) embedding layer map close to each other. More closely related to the current work, Dunbar et al (2015) investigate how well phonetic feature representations in English align with vector representations learned from local contexts of sound occurrence using both a neural language model and also a matrix factorization model.…”
Section: Related Workmentioning
confidence: 94%
“…Few existing works studied phoneme embeddings. Li et al [13] explored the application of phoneme embeddings for the task of speech-driven talking avatar synthesis to create more realistic and expressive visual gestures. Silfverberg et al [30] proposed an approach to learn phoneme embeddings that can be used to perform phonological analogies.…”
Section: Related Workmentioning
confidence: 99%
“…We use word embeddings (300 dimensions) trained on Wikipedia from GloVe [24]. The character embeddings (20-dimensional) are randomly initialized and trainable, whereas the pre-trained phoneme embeddings are not trainable 13 , except the randomly initialized ones (rnd model).…”
Section: Experimental Settingsmentioning
confidence: 99%