2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2017
DOI: 10.1109/asru.2017.8269013
|View full text |Cite
|
Sign up to set email alerts
|

Deep learning methods for unsupervised acoustic modeling — Leap submission to ZeroSpeech challenge 2017

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
13
0

Year Published

2017
2017
2022
2022

Publication Types

Select...
4
4

Relationship

1
7

Authors

Journals

citations
Cited by 13 publications
(13 citation statements)
references
References 8 publications
0
13
0
Order By: Relevance
“…[59] first clustered speech frames, then trained a neural network to predict the cluster IDs and used its hidden representation as features. [60] extended this scheme with features discovered by an autoencoder trained on MFCCs.…”
Section: Related Workmentioning
confidence: 99%
“…[59] first clustered speech frames, then trained a neural network to predict the cluster IDs and used its hidden representation as features. [60] extended this scheme with features discovered by an autoencoder trained on MFCCs.…”
Section: Related Workmentioning
confidence: 99%
“…A DNN model is typically trained using available speech data. The learned features are obtained either from a designated low-dimension hidden layer of the DNN, known as the bottleneck features (BNFs) [12], or from the softmax output layer, known as the posterior features or posteriorgram [13]. To facilitate supervised training of the DNN, target labels of training speech are needed.…”
mentioning
confidence: 99%
“…One of the possible approaches is based on unsupervised clustering of training speech. The frame-level cluster indices can be used as target labels for DNN training [11]- [13]. Another approach seeks to use pre-trained outof-domain ASR systems to tokenize untranscribed in-domain speech and hence each frame is assigned with an ASR senone label [5], [14].…”
mentioning
confidence: 99%
“…Thus, the F -scores presented are not directly related to the precision and recall scores. Ansari et al [25] combine two sets of features, also trained on all five languages. The first set of features is always a high-dimensional hidden layer from an autoencoder trained on the MFCC frames.…”
Section: Trackmentioning
confidence: 99%