Using prosody to improve automatic speech recognition

Vicsi, Klára; Szaszák, György

doi:10.1016/j.specom.2010.01.003

Cited by 36 publications

(17 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since prosody provides essential discourse information that is available only from spoken language, there has been a significant amount of research towards its use for Automatic Speech Recognition (ASR) [5,6,7,8,9] and Spoken Language Understanding (SLU) [10,11,12,13] as well as its impact on ASR errors [14,15]. On a similar motivational basis as our work, Shriberg and Stolcke [13] use prosodic modelling to improve ASR and several subtasks of SLU.…”

Section: Introductionmentioning

confidence: 80%

Exploring the Correlation of Pitch Accents and Semantic Slots for Spoken Language Understanding

Stehwien

2016

Interspeech 2016

View full text Add to dashboard Cite

We investigate the correlation between pitch accents and semantic slots in human-machine speech. Using an automatic pitch accent detector on the ATIS corpus, we find that most words labelled with semantic slots also carry a pitch accent. Most of the pitch accented words that are not associated with a semantic label are still meaningful, pointing towards the speaker's intention. Our findings show that prosody constitutes a relevant and useful resource for spoken language understanding, especially considering the fact that our pitch accent detector does not require any kind of manual transcriptions during testing time.

show abstract

Section: Introductionmentioning

confidence: 80%

Exploring the Correlation of Pitch Accents and Semantic Slots for Spoken Language Understanding

Stehwien

2016

Interspeech 2016

View full text Add to dashboard Cite

show abstract

“…In [14], a HMM approach was proposed, further enhanced by [11], to automatically recover the PP structure of speech utterances. The algorithm involves a modelling step carried out by machine learning for the 7 different PP models in Hungarian for declarative modality (as presented in Table 1 [11]).…”

Section: Phonological Phrasingmentioning

confidence: 99%

“…Just like in an ASR system, backtracking is possible at intermittent points if a longer continuous speech stream is processed. Details of the approach, including acoustic feature extraction, training data, parameter settings and exhaustive evaluation for automatic phrasing, stress detection and word-boundary detection were presented in [11], hence the reader is referred to [11] and [14] for more information. Here we briefly mention that precision and recall of phrase boundaries was 0.89 for Hungarian on a read speech corpus (for the operation point characterized by equal precision and recall).…”

Section: Phonological Phrasingmentioning

confidence: 99%

A Phonological Phrase Sequence Modelling Approach for Resource Efficient and Robust Real-Time Punctuation Recovery

Moro

Szaszák

2017

Interspeech 2017

Self Cite

View full text Add to dashboard Cite

For the automatic punctuation of Automatic Speech Recognition (ASR) output, both prosodic and text based features are used, often in combination. Pure prosody based approaches usually have low computation needs, introduce little latency (delay) and they are also more robust to ASR errors. Text based approaches usually yield better performance, they are however resource demanding (both regarding their training and computational needs), often introduce high time latency and are more sensitive to ASR errors. The present paper proposes a lightweight prosody based punctuation approach following a new paradigm: we argue in favour of an all-inclusive modelling of speech prosody instead of just relying on distinct acoustic markers: first, the entire phonological phrase structure is reconstructed, then its close correlation with punctuations is exploited in a sequence modelling approach with recurrent neural networks. With this tiny and easy to implement model we reach performance in Hungarian punctuation comparable to large, text based models for other languages by keeping resource requirements minimal and suitable for real-time operation with low latency.

show abstract

“…Although, in the literature, we can find many articles regarding the automatic speech segmentation [1][2][3][4][5][6][7], the problems are not completely solved yet. The most common segmentation methods use LPC (Linear Predictive Coding), HMM (Hidden Markov Models), SVM (Support Vector Machine), the cepstrum based methods, and statistical methods.…”

Section: Introductionmentioning

confidence: 99%

The automatic segmentation of the vocal signal using predictive neural network

Zbancioc

Feraru²

2013

International Symposium on Signals, Circuits and Systems ISSCS2013

View full text Add to dashboard Cite

The automatic segmentation of the vocal signal precedes the features extraction stages, respectively the emotion recognition/classification. The extraction of the prosodic parameters as fundamental frequency (F0) and formants (F1-F4), cepstral coefficients LPCC and MFCC are made only on the vowel areas. The analysis tools from the SROL corpus are using a hybrid hierarchical system with four segmentation methods based on the autocorrelation function, AMDF method, the cepstral analysis and HPS method. Since the performance of this instrument has not been yet satisfactory, we analyzed other segmentation possibilities in order to obtain the best possible accuracy in segmentation. The predictive neural network used in this paper is in fact a simple perceptron which can approximate with high accuracy the quasi-periodic signals such as the vowels. The consonants have noisy properties and are complicated transition processes. The prediction error for the consonants comparing with the vowels is higher when it is used a sample neural network architecture.

show abstract

Using prosody to improve automatic speech recognition

Cited by 36 publications

References 9 publications

Exploring the Correlation of Pitch Accents and Semantic Slots for Spoken Language Understanding

Exploring the Correlation of Pitch Accents and Semantic Slots for Spoken Language Understanding

A Phonological Phrase Sequence Modelling Approach for Resource Efficient and Robust Real-Time Punctuation Recovery

The automatic segmentation of the vocal signal using predictive neural network

Contact Info

Product

Resources

About