Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-1339
|View full text |Cite
|
Sign up to set email alerts
|

A GPU-based WFST Decoder with Exact Lattice Generation

Abstract: We describe initial work on an extension of the Kaldi toolkit that supports weighted finite-state transducer (WFST) decoding on Graphics Processing Units (GPUs). We implement token recombination as an atomic GPU operation in order to fully parallelize the Viterbi beam search, and propose a dynamic load balancing strategy for more efficient token passing scheduling among GPU threads. We also redesign the exact lattice generation and lattice pruning algorithms for better utilization of the GPUs. Experiments on t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
11
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
6
1
1

Relationship

3
5

Authors

Journals

citations
Cited by 13 publications
(11 citation statements)
references
References 31 publications
0
11
0
Order By: Relevance
“…In practice, rapid execution of ASR decoding is essential for better user experience. Reduction of sequence length [6,5,7] and parallel computing [8,9,10] are mainly investigated for rapid computation of likelihoods and efficient traversal of search space.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…In practice, rapid execution of ASR decoding is essential for better user experience. Reduction of sequence length [6,5,7] and parallel computing [8,9,10] are mainly investigated for rapid computation of likelihoods and efficient traversal of search space.…”
Section: Introductionmentioning
confidence: 99%
“…al., [9] and Chen, et. al., [10] further extended the search algorithm by executing graph traversal on GPU. These studies focused on efficient computation of WFST (Weighted Finite-State Transducer) based decoding.…”
Section: Introductionmentioning
confidence: 99%
“…The proposed work is most closely related to and improves upon the first fully GPU-accelerated lattice decoder [20], which maps token passing constructs [13] to GPU. Starting from the single-threaded CPU decoder, we tailored the algorithm to the strengths of the hardware, including avoiding unnecessary synchronization and atomics, and using flat, compact memory structures.…”
Section: Related Workmentioning
confidence: 99%
“…Across the tested configurations, the GPU decoder outperforms the multithreaded CPU implementation within Kaldi, with a relative speedup ranging between 14x and 18x when compared to a full 20-core Xeon processor. When compared with the current state-of-the art parallel decoder [20], the proposed algorithm decodes between 11x and 41x faster. Table 2.…”
Section: Speed Improvementsmentioning
confidence: 99%
“…In the inference stage of the E2E speech recognition, prior work such as [20,21,22,23,10,24] uses n-gram LM or NNLM to bias search 2 We also force hypotheses to end in the end of WFST. Fig.…”
Section: Relation To Prior Workmentioning
confidence: 99%