Efficient parallel implementation of three‐point viterbi decoding algorithm on CPU, GPU, and FPGA

Li, Rongchun; Dou, Yong; Zou, Dan

doi:10.1002/cpe.3093

Cited by 18 publications

(12 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, our results confirm that the common comparison between serial CPU implementations and GPU implementations is quite misleading (e.g., ): In such scenarios, very large speedups in favor of GPUs are achieved; however, the GPU advantage either disappears completely as soon as the full CPU capacities are utilized or at least becomes considerably smaller (speedups in the single–digit range). This is what we observe for the min‐warping algorithm as well with speedups between 2 and 8.4 in favor of GPUs compared with multi‐core‐SIMD.…”

Section: Discussionsupporting

confidence: 68%

“…There exists a considerable amount of studies in which the performance of CPU and GPU implementations is compared for specific tasks (for example, ). Even closer to our work are studies which include FPGAs or directly compare FPGAs with GPUs . The benchmarked applications are from many different fields like machine learning , neural modeling , optimization , numerical algorithms , image and video processing [16, 22, 24–27, 30], computer tomography , molecular sequencing , financial simulations , encryption and decoding , or analog circuit simulation .…”

Section: Introductionmentioning

confidence: 99%

“…However, the full capabilites of modern CPUs are often not exploited what leads to unfair comparisons. Exceptions are studies in which at least multi‐threading is applied to occupy all CPU cores , or in which both multi‐threading and vectorized code are used . In such a setting, Alachiotis et al .…”

Section: Introductionmentioning

confidence: 99%

“…Third, the number of processing elements implemented in an FPGA design is less than the number of available processing elements in current GPUs.There exists a considerable amount of studies in which the performance of CPU and GPU implementations is compared for specific tasks (for example, [3,[7][8][9][10][11][12][13][14][15]). Even closer to our work are studies which include FPGAs or directly compare FPGAs with GPUs [7,[16][17][18][19][20][21][22][23][24][25][26][27][28][29][30][31][32]. The benchmarked applications are from many different fields like machine learning [9, 10, 15], neural modeling [11], optimization [13,21], numerical algorithms [7,26], image and video processing [16,22,[24][25][26][27]30], computer tomography [20, 31], molecular sequencing [3], financial simulations [19], encryption and decoding [23,32], or analog circuit simulation [17].The contest between CPU and GPU implementations is normally won by the GPU side.…”

mentioning

confidence: 99%

See 3 more Smart Citations

Comparing parallel hardware architectures for visually guided robot navigation

Schenck

Horst

Tiedemann

et al. 2016

Concurrency and Computation

View full text Add to dashboard Cite

Summary Local visual homing methods are a family of algorithms for visually guided navigation on mobile robots. Within this family, the so‐called min‐warping algorithm yields very precise results but is rather compute‐intensive. For this reason, we developed several implementations of this algorithm for different parallel hardware architectures (multi‐core CPUs with SIMD extensions, graphics processing units (GPUs), field‐programmable gate array) to arrive at a fast and energy‐efficient solution which is suited for real‐time performance on mobile platforms with limited battery capacity. Because the min‐warping algorithm is also well suited as a general benchmark, we carried out a comprehensive comparison study which includes both speed and real‐power measurements and covers both low‐power processors and high‐end devices. Our findings suggest that field‐programmable gate arrays offer the most energy‐efficient platform for min‐warping in the area of low‐power processors, while GPUs take the lead in the area of high‐end devices. However, as soon as the full capabilities of modern CPUs (like vector execution units and multiple hardware threads) are used, the speedup advantage of GPUs goes down to the single digit range. Copyright © 2016 John Wiley & Sons, Ltd.

show abstract

Section: Discussionsupporting

confidence: 68%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

Comparing parallel hardware architectures for visually guided robot navigation

Schenck

Horst

Tiedemann

et al. 2016

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…Although the use of graphics processing units (GPUs) is now de rigeur in applications of neural networks and made easy through toolkits like Theano (Theano Development Team, 2016), there has been little previous work, to our knowledge, on acceleration of weighted finite-state computations on GPUs (Narasiman et al, 2011;Li et al, 2014;Peng et al, 2016;Chong et al, 2009). In this paper, we consider the operations that are most likely to have high speed requirements: decoding using the Viterbi algorithm, and training using the forward-backward algorithm.…”

Section: Introductionmentioning

confidence: 99%

Decoding with Finite-State Transducers on GPUs

Argueta¹,

Chiang

2017

Proceedings of the 15th Conference of the European Chapter of The Association for Computational Linguistics: Volume 1

View full text Add to dashboard Cite

Weighted finite automata and transducers (including hidden Markov models and conditional random fields) are widely used in natural language processing (NLP) to perform tasks such as morphological analysis, part-of-speech tagging, chunking, named entity recognition, speech recognition, and others. Parallelizing finite state algorithms on graphics processing units (GPUs) would benefit many areas of NLP. Although researchers have implemented GPU versions of basic graph algorithms, limited previous work, to our knowledge, has been done on GPU algorithms for weighted finite automata. We introduce a GPU implementation of the Viterbi and forward-backward algorithm, achieving decoding speedups of up to 5.2x over our serial implementation running on different computer architectures and 6093x over OpenFST.

show abstract

Parallel sphere detector algorithm providing optimal MIMO detection on massively parallel architectures

Jozsa

Vidal

Martínez-Zaldívar

et al. 2015

Concurrency and Computation

View full text Add to dashboard Cite

Multiple-input multiple-output (MIMO) systems have attracted considerable attention in wireless communications because they offer a significant increase in data throughput and link coverage without additional bandwidth requirement or increased transmit power. The price that has to be paid is the increased complexity of hardware components and algorithms. The sphere detector (SD) algorithm solves the problem of maximum likelihood (ML) detection for MIMO channels by significantly reducing the search space of possible solutions. The main drawback of the SD algorithm is in its sequential nature, consequently, running it on massively parallel architectures (MPAs) is very inefficient. In order to overcome the drawbacks of the SD algorithm, a new parallel sphere detector (PSD) algorithm is proposed. It implements a novel hybrid tree search method, where the algorithm parallelism is assured by the efficient combination of depth-first search and breadth-first search algorithms. A path metric-based parallel sorting is employed at each intermediate stage. The PSD algorithm is able to adjust its memory requirements and extent of parallelism to fit a wide range of parallel architectures. Mapping details for MPAs are proposed by giving the details of thread dependent, highly parallel building blocks of the algorithm. Based on the building blocks proposed, a mapping to general-purpose graphics processing unit is provided, and its performance is evaluated. In order to achieve high-throughput, several levels of parallelism are introduced, and different scheduling strategies are considered.In the first approach the robustness of MIMO is maximized, that is, the probability of error is minimized with the use of space-time codes (STCs). STCs rely on transmitting different representations of the same data stream on different parallel transmit branches, that is, it introduces controlled redundancy in both space and time.Spatial Multiplexing (SM), the second approach, focuses on maximizing the capacity of a radio link by transmitting independent data streams on different transmit branches simultaneously and within the same frequency band. The price that has to be paid is the increased complexity of detection hardware components and algorithms. The complexity of detection algorithms depends on many factors, such as antenna configuration, modulation order, channel, and coding.With regard to the bit error rate (BER) performance, the maximum likelihood (ML) detector offers the best BER performance; however, its exponential complexity is not suitable for real-time applications. The SD algorithm has been proposed in the literature to significantly reduce the search space of possible solutions while still providing the ML solution. For a few good examples, refer to [2-4] and [5].In non-optimal detectors, the complexity of the sphere detector (SD) algorithm is reduced by introducing some approximations such as (i) early termination of the search, (ii) introducing constraints on the maximum number of nodes that the detector algorithm is allowed ...

show abstract

Efficient parallel implementation of three‐point viterbi decoding algorithm on CPU, GPU, and FPGA

Cited by 18 publications

References 19 publications

Comparing parallel hardware architectures for visually guided robot navigation

Comparing parallel hardware architectures for visually guided robot navigation

Decoding with Finite-State Transducers on GPUs

Parallel sphere detector algorithm providing optimal MIMO detection on massively parallel architectures

Contact Info

Product

Resources

About