Assessing and improving the performance of speech recognition for incremental systems

Baumann, Timo; Atterer, Michaela; Schlangen, David

doi:10.3115/1620754.1620810

Cited by 27 publications

(29 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our corpus hence totals 1687 utterances, with an average of 5.43 words per utterance (sd 2.36), and a vocabulary of 237 distinct words. We performed the experiments reported below both with manual transcriptions of the utterances as well as with asr transcriptions (for which we used the version of Sphinx4 described in Baumann et al, 2009, with models fine-tuned to this domain, achieving a word error rate of 0.24).…”

Section: Data and Taskmentioning

confidence: 99%

Situated incremental natural language understanding using Markov Logic Networks

Kennington¹,

Schlangen²

2014

Computer Speech & Language

Self Cite

View full text Add to dashboard Cite

Section: Data and Taskmentioning

confidence: 99%

Situated incremental natural language understanding using Markov Logic Networks

Kennington¹,

Schlangen²

2014

Computer Speech & Language

Self Cite

View full text Add to dashboard Cite

“…In [6], the points at which partial hypotheses are computed are carefully selected to be at times when the ASR either has high confidence in the current word or the language model end of utterance symbol has been reached. In [8], additional right context is included before a partial hypothesis is returned, which introduces a short lag but improves stability.…”

Section: Incremental Dialoguementioning

confidence: 99%

Continuous asr for flexible incremental dialogue

Breslin

Gašić

Henderson

et al. 2013

2013 IEEE International Conference on Acoustics, Speech and Signal Processing

View full text Add to dashboard Cite

Spoken dialogue systems provide a convenient way for users to interact with a machine using only speech. However, they often rely on a rigid turn taking regime in which a voice activity detection module is used to determine when the user is speaking and decide when is an appropriate time for the system to respond. This paper investigates replacing the VAD and discrete utterance recogniser of a conventional turn-taking system with a continuously operating recogniser that is always listening, and using the recogniser 1-best path to guide turn taking. In this way, a flexible framework for incremental dialogue management is possible. Experimental results show that it is possible to remove the VAD component and successfully use the recogniser best path to identify user speech, with more robustness to noise, potentially smaller latency times, and a reduction in overall recognition error rate compared to using the conventional approach.

show abstract

“…Fink et al [7] found that providing more right context (i.e., more acoustic information) could improve accuracy. Likewise, Baumann et al [4] showed that increasing the language model weight of words in the lattice could improve accuracy. Selfridge et al [25] took both of these ideas further and proposed an algorithm that looked for paths in the lattice that either terminated in an end-of-sentence (as deemed by the language model), or converged to a single node.…”

Section: Related Workmentioning

confidence: 99%

Voice typing

Kumar

Paek

Lee

2012

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems

View full text Add to dashboard Cite

Dictation using speech recognition could potentially serve as an efficient input method for touchscreen devices. However, dictation systems today follow a mentally disruptive speech interaction model: users must first formulate utterances and then produce them, as they would with a voice recorder. Because utterances do not get transcribed until users have finished speaking, the entire output appears and users must break their train of thought to verify and correct it. In this paper, we introduce Voice Typing, a new speech interaction model where users' utterances are transcribed as they produce them to enable real-time error identification. For fast correction, users leverage a marking menu using touch gestures. Voice Typing aspires to create an experience akin to having a secretary type for you, while you monitor and correct the text. In a user study where participants composed emails using both Voice Typing and traditional dictation, they not only reported lower cognitive demand for Voice Typing but also exhibited 29% relative reduction of user corrections.

show abstract

Assessing and improving the performance of speech recognition for incremental systems

Cited by 27 publications

References 6 publications

Situated incremental natural language understanding using Markov Logic Networks

Situated incremental natural language understanding using Markov Logic Networks

Continuous asr for flexible incremental dialogue

Voice typing

Contact Info

Product

Resources

About