JParEnt: Parallel entropy decoding for JPEG decompression on heterogeneous multicore architectures

Sodsong, Wasuwee; Jung, Minyoung; Park, Jin-Woo; Burgstaller, Bernd

doi:10.1002/cpe.4111

Cited by 4 publications

(2 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Yan et al [3] accelerated JPEG decoding on GPUs by parallelizing the IDCT step using CUDA. Sodsong et al [4] solved the problem of bitstream decoding by letting the host system determine the positions of individual syntax elements (bit sequences). Using this information, the actual decoding can be performed in parallel on the GPU.…”

Section: Related Workmentioning

confidence: 99%

Accelerating JPEG Decompression on GPUs

Weißenberger¹,

Schmidt²

2021

Preprint

View full text Add to dashboard Cite

The JPEG compression format has been the standard for lossy image compression for over multiple decades, offering high compression rates at minor perceptual loss in image quality. For GPU-accelerated computer vision and deep learning tasks, such as the training of image classification models, efficient JPEG decoding is essential due to limitations in memory bandwidth. As many decoder implementations are CPU-based, decoded image data has to be transferred to accelerators like GPUs via interconnects such as PCI-E, implying decreased throughput rates. JPEG decoding therefore represents a considerable bottleneck in these pipelines. In contrast, efficiency could be vastly increased by utilizing a GPU-accelerated decoder. In this case, only compressed data needs to be transferred, as decoding will be handled by the accelerators. In order to design such a GPU-based decoder, the respective algorithms must be parallelized on a fine-grained level. However, parallel decoding of individual JPEG files represents a complex task. In this paper, we present an efficient method for JPEG image decompression on GPUs, which implements an important subset of the JPEG standard. The proposed algorithm evaluates codeword locations at arbitrary positions in the bitstream, thereby enabling parallel decompression of independent chunks. Our performance evaluation shows that on an A100 (V100) GPU our implementation can outperform the state-of-the-art implementations libjpeg-turbo (CPU) and nvJPEG (GPU) by a factor of up to 51 (34) and 8.0 (5.7). Furthermore, it achieves a speedup of up to 3.4 over nvJPEG accelerated with the dedicated hardware JPEG decoder on an A100.

show abstract

Section: Related Workmentioning

confidence: 99%

Accelerating JPEG Decompression on GPUs

Weißenberger¹,

Schmidt²

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…For example, the paper successfully adapts the BLAS algorithms for heterogeneous multi‐GPU, multicore, and multi‐MIC architectures . Also, the paper proposes the JParEnt algorithm, which vastly improves the scalability of existing JPEG decompression algorithms on heterogeneous multicore and manycore architectures …”

Section: Themes Of This Special Issuementioning

confidence: 99%

Foreword to the Special Issue of the workshop on the seventh international workshop on programming models and applications for multicores and manycores (PMAM 2016)

Balaji

Leung²

2017

Concurrency and Computation

View full text Add to dashboard Cite

INTRODUCTIONRapid advancements in multicore and chip-level multithreading technologies open new challenges and make multicore and manycore systems a part of the computing landscape. From high-end servers to mobile phones, multicores and manycores are steadily entering every single aspect of the information technology.However, most programmers are trained in sequential programming, yet most existing parallel programming models are prone to errors such as data race and deadlock. Therefore, to fully use multicore and manycore hardware, parallel programming models that allow easy transition of sequential programs to parallel programs with good performance and enable development of error-free codes are urgently needed. THEMES OF THIS SPECIAL ISSUEThis special issue contains research papers addressing the state-of-the-art technologies related to multicore and manycore systems. The set of accepted papers can be organized under the following key themes: Programming Models, Performance Improvements, and Applications. Programming modelsThere are several developments in programming models that allow automated parallelization of code, and eliminate, or at least detect, programming errors such as data race. The paper 1 proposes a model where a function calls other functions by using communication channels. 1 This completely eliminates passing states with the callee functions, making the results deterministic. As a result, the underlying hardware can automate parallelization of the code by spawning these callee functions as tasks running concurrently with the parent function as hardware cores become available. In this way, this model allows automatic, data race-free parallelization of existing applications that can scale well on manycore hardware.As the multicore and manycore systems proliferate in the market, it is common to parallelize existing applications with shared memory models, where access of shared variables between threads are managed by synchronization primitives and/or lock-free data mechanisms. However, it is challenging to use these interfaces appropriately. As a result, data race can often happen, which are difficult to detect and reproduce. Race detectors such as Intel Cilkscreen can be used to detect data race, but they often introduce performance penalties, and give false positives if they are unaware of the underlying lock-free structure semantics. To mitigate this issue, the paper 2 extends the race detector ThreadSanitizer, with the semantic of 2 lock-free data structures: the Single-Producer/Single-Consumer (SPSC) and the Multiple-Producer/Multiple-Consumer (MPMC) queues. Experimental results demonstrate that these improvements eliminated 60% of the false-positive warning and can accurately detect the wrong use of these data race-free structures. 2To improve programmability over manycore architectures, several high-level programming models are proposed, such as Kokkos, RAJA, OpenACC, and OpenMP 4.0. The paper 3 benchmarks these programming models against mature low-level programming models CUDA and Ope...

show abstract

An efficient parallel entropy coding method for JPEG compression based on GPU

Zhu

Yan

2021

J Supercomput

View full text Add to dashboard Cite

JParEnt: Parallel entropy decoding for JPEG decompression on heterogeneous multicore architectures

Cited by 4 publications

References 24 publications

Accelerating JPEG Decompression on GPUs

Accelerating JPEG Decompression on GPUs

Foreword to the Special Issue of the workshop on the seventh international workshop on programming models and applications for multicores and manycores (PMAM 2016)

An efficient parallel entropy coding method for JPEG compression based on GPU

Contact Info

Product

Resources

About