With the emergence of social networks and improvements in computational photography, billions of JPEG images are shared and viewed on a daily basis. Desktops, tablets and smartphones constitute the vast majority of hardware platforms used for displaying JPEG images. Despite the fact that these platforms are heterogeneous multicores, no approach exists yet that is capable of joining forces of a system's CPU and GPU for JPEG decoding.In this paper we introduce a novel JPEG decoding scheme for heterogeneous architectures consisting of a CPU and an OpenCLprogrammable GPU. We employ an offline profiling step to determine the performance of a system's CPU and GPU with respect to JPEG decoding. For a given JPEG image, our performance model uses (1) the CPU and GPU performance characteristics, (2) the image entropy and (3) the width and height of the image to balance the JPEG decoding workload on the underlying hardware. Our runtime partitioning and scheduling scheme exploits task, data and pipeline parallelism by scheduling the non-parallelizable entropy decoding task on the CPU, whereas inverse cosine transformations (IDCTs), color conversions and upsampling are conducted on both the CPU and the GPU. Our kernels have been optimized for GPU memory hierarchies.We have implemented the proposed method in the context of the libjpeg-turbo library, which is an industrial-strength JPEG encoding and decoding engine. Libjpeg-turbo's hand-optimized SIMD routines for ARM and x86 constitute a competitive yardstick for the comparison to the proposed approach. Retro-fitting our method with libjpeg-turbo provides insights on the software-engineering aspects of re-engineering legacy code for heterogeneous multicores.We have evaluated our approach for a total of 7194 JPEG images across three high-and middle-end CPU-GPU combinations. We achieve speedups of up to 4.2x over the SIMD-version of libjpeg-turbo, and speedups of up to 8.5x over its sequential code. Taking into account the non-parallelizable JPEG entropy decoding part, our approach achieves up to 95% of the theoretically attainable maximal speedup, with an average of 88%.
With the emergence of social networks and improvements in computational photography, billions of JPEG images are shared and viewed on a daily basis. Desktops, tablets and smartphones constitute the vast majority of hardware platforms used for displaying JPEG images. Despite the fact that these platforms are heterogeneous multicores, no approach exists yet that is capable of joining forces of a system's CPU and GPU for JPEG decoding.In this paper we introduce a novel JPEG decoding scheme for heterogeneous architectures consisting of a CPU and an OpenCLprogrammable GPU. We employ an offline profiling step to determine the performance of a system's CPU and GPU with respect to JPEG decoding. For a given JPEG image, our performance model uses (1) the CPU and GPU performance characteristics, (2) the image entropy and (3) the width and height of the image to balance the JPEG decoding workload on the underlying hardware. Our runtime partitioning and scheduling scheme exploits task, data and pipeline parallelism by scheduling the non-parallelizable entropy decoding task on the CPU, whereas inverse cosine transformations (IDCTs), color conversions and upsampling are conducted on both the CPU and the GPU. Our kernels have been optimized for GPU memory hierarchies.We have implemented the proposed method in the context of the libjpeg-turbo library, which is an industrial-strength JPEG encoding and decoding engine. Libjpeg-turbo's hand-optimized SIMD routines for ARM and x86 constitute a competitive yardstick for the comparison to the proposed approach. Retro-fitting our method with libjpeg-turbo provides insights on the software-engineering aspects of re-engineering legacy code for heterogeneous multicores.We have evaluated our approach for a total of 7194 JPEG images across three high-and middle-end CPU-GPU combinations. We achieve speedups of up to 4.2x over the SIMD-version of libjpeg-turbo, and speedups of up to 8.5x over its sequential code. Taking into account the non-parallelizable JPEG entropy decoding part, our approach achieves up to 95% of the theoretically attainable maximal speedup, with an average of 88%.
Summary The JPEG format employs Huffman codes to compress the entropy data of an image. Huffman codewords are of variable length, which makes parallel entropy decoding a difficult problem. To determine the start position of a codeword in the bitstream, the previous codeword must be decoded first. We present JParEnt, a new approach to parallel entropy decoding for JPEG decompression on heterogeneous multicores. JParEnt conducts JPEG decompression in two steps: (1) an efficient sequential scan of the entropy data on the CPU to determine the start‐positions (boundaries) of coefficient blocks in the bitstream, followed by (2) a parallel entropy decoding step on the graphics processing unit (GPU). The block boundary scan constitutes a reinterpretation of the Huffman‐coded entropy data to determine codeword boundaries in the bitstream. We introduce a dynamic workload partitioning scheme to account for GPUs of low compute power relative to the CPU. This configuration has become common with the advent of SoCs with integrated graphics processors (IGPs). We leverage additional parallelism through pipelined execution across CPU and GPU. For systems providing a unified address space between CPU and GPU, we employ zero‐copy to completely eliminate the data transfer overhead. Our experimental evaluation of JParEnt was conducted on six heterogeneous multicore systems: one server and two desktops with dedicated GPUs, one desktop with an IGP, and two embedded systems. For a selection of more than 1000 JPEG images, JParEnt outperforms the SIMD–implementation of the libjpeg‐turbo library by up to a factor of 4.3×, and the previously fastest JPEG decompression method for heterogeneous multicores by up to a factor of 2.2×. JParEnt's entropy data scan consumes 45% of the entropy decoding time of libjpeg‐turbo on average. Given this new ratio for the sequential part of JPEG decompression, JParEnt achieves up to 97% of the maximum attainable speedup (95% on average). On the IGP‐based desktop platform, JParEnt achieves energy savings of up to 45% compared to libjpeg‐turbo's SIMD‐implementation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.