A new text-independent method for phoneme segmentation

Aversano, Guido; Esposito, Anna; Marinaro, M.

doi:10.1109/mwscas.2001.986241

Cited by 63 publications

(59 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Using a linguistically constrained hidden Markov model (HMM) based method, they yield over 85% boundary detection rate in noisefree environments at 20msec boundary misalignment (tolerance). Aversano et al (2001) introduce a novel approach for text-independent speech segmentation where the preprocessing is based on criticalband perceptual analysis. It results in 74% segmentation accuracy while limiting over-segmentation to a minimum.…”

Section: Phonemic Segmentationmentioning

confidence: 99%

Phonemic segmentation using the generalised Gamma distribution and small sample Bayesian information criterion

Almpanidis

Kotropoulos

2008

Speech Communication

View full text Add to dashboard Cite

Section: Phonemic Segmentationmentioning

confidence: 99%

Phonemic segmentation using the generalised Gamma distribution and small sample Bayesian information criterion

Almpanidis

Kotropoulos

2008

Speech Communication

View full text Add to dashboard Cite

“…Aversano et al 2001;Qiao et al 2008;Goldwater et al 2009;Scharenborg et al 2010). Such traditional segmentation measures cannot be used for evaluating phone acquisition because they either do not assign segments to phones (each segment being equally unrelated to all other segments); or because they classify segments in a trivial way: two segments are assumed to be in the same class if their sequences are identical and vice versa.…”

Section: Evaluation Measures For Segmentationmentioning

confidence: 99%

A Computational Model of Unsupervised Speech Segmentation for Correspondence Learning

Duran

Schütze

Möbius

et al. 2010

Res on Lang and Comput

View full text Add to dashboard Cite

In this paper, we develop a new conceptual framework for an important problem in language acquisition, the correspondence problem: the fact that a given utterance has different manifestations in the speech and articulation of different speakers and that the correspondence of these manifestations is difficult to learn. We put forward the Correspondence-by-Segmentation Hypothesis, which states that correspondence is primarily learned by first segmenting speech in an unsupervised manner and then mapping the acoustics of different speakers onto each other. We show that a rudimentary segmentation of speech can be learned in an unsupervised fashion. We then demonstrate that, using the previously learned segmentation, different instances of a word can be mapped onto each other with high accuracy when trained on utterance-label pairs for a small set of words.

show abstract

“…The center frequencies of the filter bank are uniformly distributed along the Bark scale, whereas the corresponding bandwidths are defined by (2). From these, a vector of Perceptual Critical Band Features (PCBF) [40] is computed as the log-energy of the acoustic signal:…”

Section: Barkmentioning

confidence: 99%

Speech-driven facial animation with realistic dynamics

Gutierrez‐Osuna

Kakumanu

Esposito

et al. 2005

IEEE Trans. Multimedia

Self Cite

View full text Add to dashboard Cite

Abstract-This paper presents an integral system capable of generating animations with realistic dynamics, including the individualized nuances, of three-dimensional (3-D) human faces driven by speech acoustics. The system is capable of capturing short phenomena in the orofacial dynamics of a given speaker by tracking the 3-D location of various MPEG-4 facial points through stereovision. A perceptual transformation of the speech spectral envelope and prosodic cues are combined into an acoustic feature vector to predict 3-D orofacial dynamics by means of a nearest-neighbor algorithm. The Karhunen-Loéve transformation is used to identify the principal components of orofacial motion, decoupling perceptually natural components from experimental noise. We also present a highly optimized MPEG-4 compliant player capable of generating audio-synchronized animations at 60 frames/s. The player is based on a pseudo-muscle model augmented with a nonpenetrable ellipsoidal structure to approximate the skull and the jaw. This structure adds a sense of volume that provides more realistic dynamics than existing simplified pseudo-muscle-based approaches, yet it is simple enough to work at the desired frame rate. Experimental results on an audiovisual database of compact TIMIT sentences are presented to illustrate the performance of the complete system. Index Terms-face image analysis and synthesis, lip synchronization, 3-D audio/video processing.

show abstract

A new text-independent method for phoneme segmentation

Cited by 63 publications

References 4 publications

Phonemic segmentation using the generalised Gamma distribution and small sample Bayesian information criterion

Phonemic segmentation using the generalised Gamma distribution and small sample Bayesian information criterion

A Computational Model of Unsupervised Speech Segmentation for Correspondence Learning

Speech-driven facial animation with realistic dynamics

Contact Info

Product

Resources

About