Impact of Accuracy and Latency on Mean Opinion Scores for Speech Recognition Solutions

Scovell, James; Beltman, Marco; Doherty, Rina; Elnaggar, Rania; Sreerama, Chaitanya

doi:10.1016/j.promfg.2015.07.434

Cited by 4 publications

(3 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…T bat indicates the length of the audio samples to be decoded at a time, and is defined during the initialization stage of the ASR server. In this paper, we set T bat as 200 frames, each of which is 2-s long [14,26,36].…”

Section: Decoder Thread Of the Online Asr Servermentioning

confidence: 99%

See 1 more Smart Citation

Online Speech Recognition Using Multichannel Parallel Acoustic Score Computation and Deep Neural Network (DNN)- Based Voice-Activity Detector

Park

2020

Applied Sciences

View full text Add to dashboard Cite

This paper aims to design an online, low-latency, and high-performance speech recognition system using a bidirectional long short-term memory (BLSTM) acoustic model. To achieve this, we adopt a server-client model and a context-sensitive-chunk-based approach. The speech recognition server manages a main thread and a decoder thread for each client and one worker thread. The main thread communicates with the connected client, extracts speech features, and buffers the features. The decoder thread performs speech recognition, including the proposed multichannel parallel acoustic score computation of a BLSTM acoustic model, the proposed deep neural network-based voice activity detector, and Viterbi decoding. The proposed acoustic score computation method estimates the acoustic scores of a context-sensitive-chunk BLSTM acoustic model for the batched speech features from concurrent clients, using the worker thread. The proposed deep neural network-based voice activity detector detects short pauses in the utterance to reduce response latency, while the user utters long sentences. From the experiments of Korean speech recognition, the number of concurrent clients is increased from 22 to 44 using the proposed acoustic score computation. When combined with the frame skipping method, the number is further increased up to 59 clients with a small accuracy degradation. Moreover, the average user-perceived latency is reduced from 11.71 s to 3.09–5.41 s by using the proposed deep neural network-based voice activity detector.

show abstract

Section: Decoder Thread Of the Online Asr Servermentioning

confidence: 99%

“…During decoding, the minibatch size is set to 2 s. Although a larger minibatch size increases the decoding speed owing to the bulk computation of GPU, the latency also increases. We settle into 2 s of minibatch size as a compromise between decoding speed and latency [14,26,36].…”

Section: Corpus and Baseline Korean Asrmentioning

confidence: 99%

Online Speech Recognition Using Multichannel Parallel Acoustic Score Computation and Deep Neural Network (DNN)- Based Voice-Activity Detector

Park

2020

Applied Sciences

View full text Add to dashboard Cite

show abstract

“…5) and they cause little extra overhead. According to a study conducted on 47 participants [43], the acceptable latency is 4 s and the acceptable accuracy is 0.70.…”

Section: Implementation and Overheadmentioning

confidence: 99%

Hidebehind

Qian

Hou

et al. 2018

Proceedings of the 16th ACM Conference on Embedded Networked Sensor Systems

View full text Add to dashboard Cite

We are speeding toward a not-too-distant future when we can perform human-computer interaction using solely our voice. Speech recognition is the key technology that powers voice input, and it is usually outsourced to the cloud for the best performance. However, user privacy is at risk because voiceprints are directly exposed to the cloud, which gives rise to security issues such as spoof attacks on speaker authentication systems. Additionally, it may cause privacy issues as well, for instance, the speech content could be abused for user profiling. To address this unexplored problem, we propose to add an intermediary between users and the cloud, named VoiceMask, to anonymize speech data before sending it to the cloud for speech recognition. It aims to mitigate the security and privacy risks by concealing voiceprints from the cloud. VoiceMask is built upon voice conversion but is much more than that; it is resistant to two de-anonymization attacks and satisfies differential privacy. It performs anonymization in resource-limited mobile devices while still maintaining the usability of the cloud-based voice input service. We implement VoiceMask on Android and present extensive experimental results. The evaluation substantiates the efficacy of VoiceMask, e.g., it is able to reduce the chance of a user's voice being identified from 50 people by a mean of 84%, while reducing voice input accuracy no more than 14.2%. CCS CONCEPTS • Security and privacy Pseudonymity, anonymity and untraceability; Data anonymization and sanitization; • Human-centered computing Ubiquitous and mobile computing;

show abstract

Keeping Users in the Flow: Mapping System Responsiveness with User Experience

Doherty

Sorenson

2015

Procedia Manufacturing

View full text Add to dashboard Cite

Impact of Accuracy and Latency on Mean Opinion Scores for Speech Recognition Solutions

Cited by 4 publications

References 3 publications

Online Speech Recognition Using Multichannel Parallel Acoustic Score Computation and Deep Neural Network (DNN)- Based Voice-Activity Detector

Online Speech Recognition Using Multichannel Parallel Acoustic Score Computation and Deep Neural Network (DNN)- Based Voice-Activity Detector

Hidebehind

Keeping Users in the Flow: Mapping System Responsiveness with User Experience

Contact Info

Product

Resources

About