2013 IEEE International Conference on Acoustics, Speech and Signal Processing 2013
DOI: 10.1109/icassp.2013.6639241
|View full text |Cite
|
Sign up to set email alerts
|

Weak top-down constraints for unsupervised acoustic model training

Abstract: Typical supervised acoustic model training relies on strong top-down constraints provided by dynamic programming alignment of the input observations to phonetic sequences derived from orthographic word transcripts and pronunciation dictionaries. This paper investigates a much weaker form of top-down supervision for use in place of transcripts and dictionaries in the zero resource setting. Our proposed constraints, which can be produced using recent spoken term discovery systems, come in the form of pairs of is… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
82
0

Year Published

2013
2013
2021
2021

Publication Types

Select...
4
3

Relationship

2
5

Authors

Journals

citations
Cited by 49 publications
(83 citation statements)
references
References 21 publications
1
82
0
Order By: Relevance
“…Likewise, stripped of any guidance from word transcripts and a pronunciation dictionary, the normal expectation-maximization training procedures for Gaussian mixturebased acoustic models are no longer capable of identifying speaker independent phonetic categories in a purely bottom-up fashion [24]. With these considerations in mind, an explicit goal of the workshop was to evaluate a variety of acoustic front-ends and unsupervised acoustic modeling strategies, both in isolation and in combination, for suitability in downstream zero resource technologies.…”
Section: Speaker Independence Of Acoustic Features and Unsupervised Mmentioning
confidence: 99%
See 1 more Smart Citation
“…Likewise, stripped of any guidance from word transcripts and a pronunciation dictionary, the normal expectation-maximization training procedures for Gaussian mixturebased acoustic models are no longer capable of identifying speaker independent phonetic categories in a purely bottom-up fashion [24]. With these considerations in mind, an explicit goal of the workshop was to evaluate a variety of acoustic front-ends and unsupervised acoustic modeling strategies, both in isolation and in combination, for suitability in downstream zero resource technologies.…”
Section: Speaker Independence Of Acoustic Features and Unsupervised Mmentioning
confidence: 99%
“…This is a much weaker form of supervision, but it comes at little cost. The approach considered in the workshop, described in detail in [24], consists of four steps: (1) training a 1024-component Gaussian mixture model (GMM) on a large sample of indomain audio, which serves as a sort of universal background model (UBM) for all speech sounds; (2) running a spoken term discovery system across the speech collection to produce a collection of word or phrase segment pairs and compute UBM posteriorgrams for each segment; (3) performing a DTW alignment of the acoustic frames of each word segment pair and use the frame-level correspondences to construct a similarity matrix over UBM components; and (4) partitioning the UBM Gaussian components with spectral clustering [26] and using each subset to define a subword unit GMM.…”
Section: Spectral Smoothing and Top-down Lexical Constraintsmentioning
confidence: 99%
“…In the first case, acoustic models are inferred directly from the acoustic features [14]- [21]. The second approach is to first segment the speech into syllable-or word-like units, and afterwards break these units into smaller subword units [7], [13], [19], [22]- [29].…”
Section: Introductionmentioning
confidence: 99%
“…The states from each HMM are then clustered based on the similarity of their distributions, to form subword unit candidates. A related approach is taken in [22], where instead of HMM states, components from a GMM trained on speech frames are clustered based on co-occurence in pairs of fragments obtained from UTD. A neural network referred to as the ABnet and based on siamese networks [30] is introduced in [25].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation