2018 IEEE Spoken Language Technology Workshop (SLT) 2018
DOI: 10.1109/slt.2018.8639553
|View full text |Cite
|
Sign up to set email alerts
|

Phonetic-and-Semantic Embedding of Spoken words with Applications in Spoken Content Retrieval

Abstract: Word embedding or Word2Vec has been successful in offering semantics for text words learned from the context of words. Audio Word2Vec was shown to offer phonetic structures for spoken words (signal segments for words) learned from signals within spoken words. This paper proposes a two-stage framework to perform phonetic-and-semantic embedding on spoken words considering the context of the spoken words. Stage 1 performs phonetic embedding with speaker characteristics disentangled. Stage 2 then performs semantic… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
26
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
6
4

Relationship

0
10

Authors

Journals

citations
Cited by 31 publications
(26 citation statements)
references
References 48 publications
0
26
0
Order By: Relevance
“…Recent work has explored a number of acoustic and acoustically grounded word embedding approaches. Several approaches have been developed for learning acoustic word embedding models-functions mapping arbitrary-duration spoken word signals to fixed-dimensional vectors-so as to encode either phonetic [10,11,[20][21][22][23][24] or semantic [25] information, or both [26].…”
Section: Acoustically Grounded Word Embeddingsmentioning
confidence: 99%
“…Recent work has explored a number of acoustic and acoustically grounded word embedding approaches. Several approaches have been developed for learning acoustic word embedding models-functions mapping arbitrary-duration spoken word signals to fixed-dimensional vectors-so as to encode either phonetic [10,11,[20][21][22][23][24] or semantic [25] information, or both [26].…”
Section: Acoustically Grounded Word Embeddingsmentioning
confidence: 99%
“…Recent studies have therefore explored an alignment-free methodology where an arbitrary-length speech segment is embedded in a fixed-dimensional space such that segments of the same word type have similar embeddings [17][18][19][20][21][22][23][24][25]. Segments can then be compared by simply calculating a distance in this acoustic word embedding space.…”
Section: Introductionmentioning
confidence: 99%
“…Several supervised and unsupervised acoustic embedding methods have been proposed. Supervised methods include convolutional [11][12][13] and recurrent neural network (RNN) models [14][15][16][17], trained with discriminative classification and contrastive losses. Unsupervised methods include using distances to a fixed reference set [10] and unsupervised autoencoding RNNs [18][19][20].…”
Section: Introductionmentioning
confidence: 99%