Gpu-Accelerated Viterbi Exact Lattice Decoder for Batched Online and Offline Speech Recognition

Luitjens, Justin; Leary, R. Bret; Kaldewey, Tim; Povey, Daniel

doi:10.1109/icassp40776.2020.9054099

Cited by 8 publications

(6 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

Section: Literature Reviewmentioning

confidence: 99%

“…To increase the throughput of RNN-based speech recognizers, Amodei et al, Braun et al, and Oh et al [33][34][35] used several batch processing approaches. In particular, Braun et al and Oh et al [34,35] aimed to accelerate GPU parallelization. Seki et al [36,37] proposed a multiple-utterance multiple-hypothesis vectorized beam search in CTC-attention-based end-to-end speech recognition using a VGG-RNN-based encoder-decoder and showed the decoding throughput increased using a GPU.…”

Section: Literature Reviewmentioning

confidence: 99%

See 1 more Smart Citation

Fast offline transformer‐based end‐to‐end automatic speech recognition for real‐world applications

Park

2021

ETRI Journal

View full text Add to dashboard Cite

With the recent advances in technology, automatic speech recognition (ASR) has been widely used in real-world applications. The efficiency of converting large amounts of speech into text accurately with limited resources has become more vital than ever. In this study, we propose a method to rapidly recognize a large speech database via a transformer-based end-to-end model.Transformers have improved the state-of-the-art performance in many fields. However, they are not easy to use for long sequences. In this study, various techniques to accelerate the recognition of real-world speeches are proposed and tested, including decoding via multiple-utterance-batched beam search, detecting end of speech based on a connectionist temporal classification (CTC), restricting the CTC-prefix score, and splitting long speeches into short segments. Experiments are conducted with the Librispeech dataset and the real-world Korean ASR tasks to verify the proposed methods. From the experiments, the proposed system can convert 8 h of speeches spoken at real-world meetings into text in less than 3 min with a 10.73% character error rate, which is 27.1% relatively lower than that of conventional systems.

show abstract

Section: Literature Reviewmentioning

confidence: 99%

Section: Literature Reviewmentioning

confidence: 99%

Fast offline transformer‐based end‐to‐end automatic speech recognition for real‐world applications

Park

2021

ETRI Journal

View full text Add to dashboard Cite

show abstract

“…• Transcribing the audio file with a pre-trained deep neural network acoustic model and n-gram language model via Kaldi's GPU-based decoder [16]. The output "hypothesis" transcript contains word-level timestamps.…”

Section: Forced Alignmentmentioning

confidence: 99%

“…Prior work has shown that once the acoustic model is accelerated on a GPU, roughly 90% of the run time will be spent in external language model decoding on the CPU [16]. Therefore, we were concerned that simply accelerating the acoustic model on a GPU would not give us meaningful overall speed up.…”

Section: System Implementationmentioning

confidence: 99%

“…These challenges motivated us to use GPU-based external language model decoder in Kaldi [16], for which the GPU runs both the acoustic model inference and language model decoding, without having to save acoustic model logits to disk to be loaded later by a CPU-based language model decoder. We used only 4 NVIDIA T4 GPUs to align all of our data, with each running at 250x real-time-factor (i.e., 250 hours of audio could be transcribed in 1 hour of wall-clock time by one GPU).…”

Section: System Implementationmentioning

confidence: 99%

See 1 more Smart Citation

The People's Speech: A Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage

Gálvez¹,

Diamos²,

Ciro³

et al. 2021

Preprint

View full text Add to dashboard Cite

The People's Speech is a free-to-download 30,000-hour and growing supervised conversational English speech recognition dataset licensed for academic and commercial usage under CC-BY-SA (with a CC-BY subset). The data is collected via searching the Internet for appropriately licensed audio data with existing transcriptions. We describe our data collection methodology and release our data collection system under the Apache 2.0 license. We show that a model trained on this dataset achieves a 9.98% word error rate on Librispeech's test-clean test set. Finally, we discuss the legal and ethical issues surrounding the creation of a sizable machine learning corpora and plans for continued maintenance of the project under MLCommons's sponsorship.

show abstract

A deep learning approach for automatic speech recognition of The Holy Qur’ān recitations

Tantawi

Abushariah

Hammo

2021

Int J Speech Technol

View full text Add to dashboard Cite

Gpu-Accelerated Viterbi Exact Lattice Decoder for Batched Online and Offline Speech Recognition

Cited by 8 publications

References 20 publications

Fast offline transformer‐based end‐to‐end automatic speech recognition for real‐world applications

Fast offline transformer‐based end‐to‐end automatic speech recognition for real‐world applications

The People's Speech: A Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage

A deep learning approach for automatic speech recognition of The Holy Qur’ān recitations

Contact Info

Product

Resources

About