Combining statistical and linguistic models for synthesis of prosodic contours

Ostendorf, Mari; Price, Patti; Shattuck‐Hufnagel, Stefanie; Veilleux, Nanette; Wightman, Colin W.; Garcia, Rudy

doi:10.1121/1.2027539

Cited by 9 publications

(11 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Participants listened to two stories (one male, one female speaker) from the Boston University Radio Speech Corpus (for full stimulus transcripts, see Extended Data Table 1-1) (Ostendorf et al, 1995), each once at regular speech rate and once slowed to one-third speech rate. Overall, the stimuli contained 26 paragraphs (each containing 1-4 sentences) of 10-60 s duration, with silent periods of 500-1100 ms inserted between paragraphs to allow measuring onset responses in the MEG without distortion from preceding speech.…”

Section: Speech Stimulusmentioning

confidence: 99%

Phase Alignment of Low-Frequency Neural Activity to the Amplitude Envelope of Speech Reflects Evoked Responses to Acoustic Edges, Not Oscillatory Entrainment

et al. 2023

View full text Add to dashboard Cite

The amplitude envelope of speech is crucial for accurate comprehension. Considered a key stage in speech processing, the phase of neural activity in the theta-delta bands (1 - 10 Hz) tracks the phase of the speech amplitude envelope during listening. However, the mechanisms underlying this envelope representation have been heavily debated. A dominant model posits that envelope tracking reflects entrainment of endogenous low-frequency oscillations to the speech envelope. Alternatively, envelope tracking reflects a series of evoked responses to acoustic landmarks within the envelope. It has proven challenging to distinguish these two mechanisms. To address this, we recorded magnetoencephalography while participants (n=12, 6 female) listened to natural speech, and compared the neural phase patterns to the predictions of two computational models: An oscillatory entrainment model and a model of evoked responses to peaks in the rate of envelope change. Critically, we also presented speech at slowed rates, where the spectro-temporal predictions of the two models diverge. Our analyses revealed transient theta phase-locking in regular speech, as predicted by both models. However, for slow speech we found transient theta and delta phase-locking, a pattern that was fully compatible with the evoked response model but could not be explained by the oscillatory entrainment model. Furthermore, encoding of acoustic edge magnitudes was invariant to contextual speech rate, demonstrating speech rate normalization of acoustic edge representations. Taken together, our results suggest that neural phase locking to the speech envelope is more likely to reflect discrete representation of transient information rather than oscillatory entrainment.Significance statement:Oganian and colleagues probe a highly debated topic in speech perception – the neural mechanisms underlying the cortical representation of the temporal envelope of speech. It is well established that the slow intensity profile of the speech signal, its envelope, elicits a robust brain response that “tracks” these envelope fluctuations. The oscillatory entrainment model posits that envelope tracking reflects phase alignment of endogenous neural oscillations. Here the authors provide evidence for a distinct mechanism. They show that neural speech envelope tracking arises from transient evoked neural responses to rapid increases in the speech envelope. Explicit computational modeling provides direct and compelling evidence that evoked responses are the primary mechanism underlying cortical speech envelope representations, with no evidence for oscillatory entrainment.

show abstract

Section: Speech Stimulusmentioning

confidence: 99%

Phase Alignment of Low-Frequency Neural Activity to the Amplitude Envelope of Speech Reflects Evoked Responses to Acoustic Edges, Not Oscillatory Entrainment

et al. 2023

View full text Add to dashboard Cite

show abstract

“…(Arvaniti & Baltazani 2005), in contrast to the simple high phrase tone in Figure 2. In general, the complexity of the F0 movement (indicating the existence of two targets), the scaling of the high tone, and the perceived boundary strength (Ostendorf et al 1995;Wightman et al 1992;Nespor & Vogel 2007;Pierrehumbert 1980;Beckman & Pierrehumbert 1986) were the main criteria for annotating phrase versus boundary tones. Interestingly, the results further revealed that phrasing was realized differently across the contrast levels in topic constituents, where the presence and type of edge tones varied as shown in Table 2.…”

Section: Resultsmentioning

confidence: 99%

“…Note the complex fall rise movement of the contour before the boundary corresponding to an L-H% boundary tone(Arvaniti & Baltazani 2005), in contrast to the simple high phrase tone in Figure2. In general, the complexity of the F0 movement (indicating the existence of two targets), the scaling of the high tone, and the perceived boundary strength(Ostendorf et al 1995;Wightman et al 1992;Nespor & Vogel 2007;Pierrehumbert 1980;Beckman & Pierrehumbert 1986) were the main criteria for annotating phrase versus boundary tones.…”

mentioning

confidence: 99%

The prosody of correction and contrast

Stavropoulou

Baltazani

2021

Journal of Pragmatics

View full text Add to dashboard Cite

“…This recognizer lacks a model that imposes the high-level linguistic constraints and assumes that prosody can be determined completely from their syllabic-timed acous-tic observations and pre-compiled lexical stress information. Nevertheless, it is successful on labeling pitch accents on the Radio News Corpus [3] with 84% accuracy on accent presence/absence prediction, about 30% higher than the estimated chance level. However, it does not perform well on intonational phrase boundary (IPB) detection: IPB recognition accuracy is only 71%, 12% below the estimated chance level.…”

Section: Introductionmentioning

confidence: 93%

A maximum likelihood prosody recognizer

Chen,

Hasegawa-Johnson,

Cohen

et al. 2004

Speech Prosody 2004

View full text Add to dashboard Cite

Automatic prosody recognition (APR) is of fundamental importance for automatic speech understanding. In this paper, we propose a maximum likelihood prosody recognizer consisting of a GMM-based acoustic model that models the distribution of the phone-level acoustic-prosodic observations (pitch, duration and energy) and an ANN-based language model that models the word-level stochastic dependence between prosody and syntax. Our experiments on the Radio News Corpus show that our recognizer is able to achieve 84% pitch accent recognition accuracy and 93% intonational phrase boundary (IPB) recognition accuracy in a leave-one-speaker-out task which has exceeded previous reported results on the same corpus. The same recognizer is tested on a subset of Switchboard corpus. The accuracies are degraded but still significantly better than the chance levels.

show abstract

Combining statistical and linguistic models for synthesis of prosodic contours

Cited by 9 publications

References 0 publications

Phase Alignment of Low-Frequency Neural Activity to the Amplitude Envelope of Speech Reflects Evoked Responses to Acoustic Edges, Not Oscillatory Entrainment

Phase Alignment of Low-Frequency Neural Activity to the Amplitude Envelope of Speech Reflects Evoked Responses to Acoustic Edges, Not Oscillatory Entrainment

The prosody of correction and contrast

A maximum likelihood prosody recognizer

Contact Info

Product

Resources

About