HEVC in-loop filters GPU parallelization in embedded systems

Souza, Diego F. de; Ilić, Aleksandar; Roma, Nuno; Sousa, Leonel

doi:10.1109/samos.2015.7363667

Cited by 17 publications

(22 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Regarding software-based GPU acceleration for video decoding, most of previous work targets only single HEVC decoding modules, such as Inverse Transform (IT) in [14,19], Motion Compensation (MC) in [9], Intra Prediction (IP) in [11], Deblocking Filter (DBF) in [16,25], and in-loop filters in [10]. In particular, Souza et al [13] presented a set of optimized GPU kernels, where they optimized and integrated individual HEVC modules.…”

Section: Related Workmentioning

confidence: 99%

Highly parallel HEVC decoding for heterogeneous systems with CPU and GPU

Wang

Souza

Alvarez-Mesa³

et al. 2018

Signal Processing: Image Communication

Self Cite

View full text Add to dashboard Cite

The High Efficiency Video Coding (HEVC) standard provides a higher compression efficiency than other video coding standards but at the cost of an increased computational load, which makes hard to achieve real-time encoding/decoding for ultra high-resolution and high-quality video sequences. Graphics Processing Units (GPUs) are known to provide massive processing capability for highly parallel and regular computing kernels, but not all HEVC decoding procedures are suited for GPU execution. Furthermore, if HEVC decoding is accelerated by GPUs, energy efficiency is another concern for heterogeneous CPU+GPU decoding. In this paper, a highly parallel HEVC decoder for heterogeneous CPU+GPU system is proposed. It exploits available parallelism in HEVC decoding on the CPU, GPU, and between the CPU and GPU devices simultaneously. On top of that, different workload balancing schemes can be selected according to the devoted CPU and GPU computing resources. Furthermore, an energy optimized solution is proposed by tuning GPU clock rates. Results show that the proposed decoder achieves better performance than the state-of-the-art CPU decoder, and the best performance among the workload balancing schemes depends on the available CPU and GPU computing resources. In particular, with an NVIDIA Titan X Maxwell GPU and an Intel Xeon E5-2699v3 CPU, the proposed decoder delivers 167 frames per second (fps) for Ultra HD 4K videos, when four CPU cores are used. Compared to the state-of-the-art CPU decoder using four CPU cores, the proposed decoder gains a speedup factor of 2.2×. When decoding performance is bounded by the CPU, a system wise energy reduction up to 36% is achieved by using fixed (and lower) GPU clocks, compared to the default dynamic clock settings on the GPU.

show abstract

Section: Related Workmentioning

confidence: 99%

Highly parallel HEVC decoding for heterogeneous systems with CPU and GPU

Wang

Souza

Alvarez-Mesa³

et al. 2018

Signal Processing: Image Communication

Self Cite

View full text Add to dashboard Cite

show abstract

“…For each GPU kernel, their thread block mapping is shown at the bottom. These decoding modules will be briefly introduced since their algorithm has been elaborated individually in [8]- [12]. For all target kernels, one common optimization is concerned with their ability to support video sequences with 10-bit depth, while previous approaches could only decode bitstreams with 8-bit depth.…”

Section: B Optimization Of the Decoding Procedures For Gpu Executionmentioning

confidence: 99%

“…For each sub-filter, two edges in the same direction can be processed at the same time. The thread mapping of the DBF has been optimized over [11] and [12], where an area of 256×8 samples is cooperatively processed by two warps within a thread block. When the horizontal filter starts, each warp maps to a set of 256×4 samples, where each thread maps to one horizontal edge of 8×4 samples.…”

Section: Global Memory Host Memorymentioning

confidence: 99%

“…Finally, the SAO kernel is followed to complete the entire decoding procedure. Compared to [12], vector processing operation is enabled by adopting a new thread mapping (see Fig. 3), where two warps are assigned for each thread block and each of them is responsible for 64×32 samples.…”

Section: Global Memory Host Memorymentioning

confidence: 99%

“…In this paper, an efficient parallelization of the HEVC decoder for heterogeneous CPU+GPU platforms is presented. To attain such objective, most of the HEVC procedures had to be re-designed so that sequential entropy decoder is executed on the CPU and the remaining decoding kernels are migrated and further optimized to be executed on the GPU [8]- [12]. Furthermore, a pipeline decoding scheme has been implemented between the CPU and the GPU, where both devices execute their tasks in parallel.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Efficient HEVC decoder for heterogeneous CPU with GPU systems

Wang

Alvarez-Mesa

Ching

et al. 2016

2016 IEEE 18th International Workshop on Multimedia Signal Processing (MMSP)

Self Cite

View full text Add to dashboard Cite

The High Efficiency Video Coding (HEVC) standard provides higher compression efficiency than other video coding standards but at the cost of increased computational load, which makes it hard to achieve real-time encoding/decoding of high-resolution, high-quality video sequences. In this paper, we investigate how Graphics Processing Units (GPUs) can be employed to accelerate HEVC decoding. GPUs are known to provide massive processing capability for throughput computing kernels, but the HEVC entropy decoding kernel cannot be executed efficiently on GPUs. We therefore propose a complete HEVC decoding solution for heterogeneous CPU+GPU systems, in which the entropy decoder is executed on the CPU and the remaining kernels on the GPU. Furthermore, the decoder is pipelined such that the CPU and the GPU can decode different frames in parallel. The proposed CPU+GPU decoder achieves an average frame rate of 150 frames per second for Ultra HD 4K video sequences when four CPU cores are used with an NVIDIA GeForce Titan X GPU.

show abstract