Summary
The JPEG format employs Huffman codes to compress the entropy data of an image. Huffman codewords are of variable length, which makes parallel entropy decoding a difficult problem. To determine the start position of a codeword in the bitstream, the previous codeword must be decoded first. We present JParEnt, a new approach to parallel entropy decoding for JPEG decompression on heterogeneous multicores. JParEnt conducts JPEG decompression in two steps: (1) an efficient sequential scan of the entropy data on the CPU to determine the start‐positions (boundaries) of coefficient blocks in the bitstream, followed by (2) a parallel entropy decoding step on the graphics processing unit (GPU). The block boundary scan constitutes a reinterpretation of the Huffman‐coded entropy data to determine codeword boundaries in the bitstream. We introduce a dynamic workload partitioning scheme to account for GPUs of low compute power relative to the CPU. This configuration has become common with the advent of SoCs with integrated graphics processors (IGPs). We leverage additional parallelism through pipelined execution across CPU and GPU. For systems providing a unified address space between CPU and GPU, we employ zero‐copy to completely eliminate the data transfer overhead.
Our experimental evaluation of JParEnt was conducted on six heterogeneous multicore systems: one server and two desktops with dedicated GPUs, one desktop with an IGP, and two embedded systems. For a selection of more than 1000 JPEG images, JParEnt outperforms the SIMD–implementation of the libjpeg‐turbo library by up to a factor of 4.3×, and the previously fastest JPEG decompression method for heterogeneous multicores by up to a factor of 2.2×. JParEnt's entropy data scan consumes 45% of the entropy decoding time of libjpeg‐turbo on average. Given this new ratio for the sequential part of JPEG decompression, JParEnt achieves up to 97% of the maximum attainable speedup (95% on average).
On the IGP‐based desktop platform, JParEnt achieves energy savings of up to 45% compared to libjpeg‐turbo's SIMD‐implementation.