CUDA memory optimisation strategies for motion estimation

Sayadi, Fatma Ezahra; Chouchene, Marwa; Bahri, Haythem; Khemiri, Randa; Atri, Mohamed

doi:10.1049/iet-cdt.2017.0149

Cited by 11 publications

(5 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finally, the kernel (GPU) returns to the caller (CPU) a set of indexed arrays-concerning the whole search area [40]-with the minimum distortion ratios as well as the corresponding motion vector. To minimize the data transfer costs, memory optimization strategies are used between the kernel and the host [41,42].…”

Section: Methodsmentioning

confidence: 99%

Fraction Execution Resolver Using a Hybrid Multi-CPU/GPU Encoding Scheme

Papaioannou,

Koziri,

Loukopoulos

et al. 2023

Electronics

View full text Add to dashboard Cite

Modern video coding standards make use of sub-pixel motion estimation to improve the video quality and reduce the bitrate. It is known that the fraction motion estimation (FME) part follows the integer motion estimation (IME) and adds an extra computational overhead due to the interpolation and the additional motion searches. In this paper, we propose a fraction execution resolver (FER) algorithm that lets the encoder skip the fraction part when specific criteria are met by introducing a preliminary fast test decision point (pFTDP) function for the IME part. If the pFTDP returns zero motion vectors (MVs) and the displacement search area center is also zero, then the fraction part is skipped. The pFTDP decision maker is executed only once, when a 2N × 2N block is first met, while all subsequent blocks follow this initial decision either by receiving the necessary MVs and RD from the pFTDP function or by using the precalculated IME values from the GPU kernel. For our experiments, we use a multithreaded CPU environment that also makes use of GPUs only for the integer part. Our evaluations provide a greater than 1600% encoding time saving at its peak in comparison with the default HEVC sequential mode and ideally a saving of greater than 2286% for still video frame sequences. The total average speedup for both Class A and Class B video sequences is ×13.45. The gain of the FER itself is more than ×3.9 compared with the same multithreaded setup environment. The PSNR and bitrate overhead observed are proportional to the tiling scheme used and are more related to the way CABAC works internally. The FER’s negative effects on coding efficiency are proven to be negligible. A balance between speed and quality achieved by using a lower tiling pattern is shown to minimize the negative effects of the encoding scheme pattern. The experimental results confirm the validity of our motivation, namely, that we can benefit from a software fraction execution resolver without any extra hardware costs. The gain is further increased when video sequences have more static blocks than others.

show abstract

Section: Methodsmentioning

confidence: 99%

Fraction Execution Resolver Using a Hybrid Multi-CPU/GPU Encoding Scheme

Papaioannou,

Koziri,

Loukopoulos

et al. 2023

Electronics

View full text Add to dashboard Cite

show abstract

“…The minimum distortion ratios from the previous processing step, which concerns the whole search area, are the motion vectors selected for the next step in the Host program (CPU). Memory optimization strategies are used to confine the data transfers between the host and the kernel [36]. The minimum estimated RD-ratio is defined from Equation (1), which is also computed inside the kernel (GPU program created with a modified C programming language).…”

Section: Methodsmentioning

confidence: 99%

“…For that reason, to minimize the transfer data costs, the kernel (GPU program) is implemented as a single program that executes both the A and B functions described above, at once, without having to return partial results to the host. When the 4-phase search cycle is completed, then and only then, the host receives back the final minimum RD cost array with respective MVs [36]. From this point so far, the CPU thread continues with the next step.…”

Section: Methodsmentioning

confidence: 99%

On Combining Wavefront and Tile Parallelism with a Novel GPU-Friendly Fast Search

et al. 2023

View full text Add to dashboard Cite

As the necessity of supporting ever-increasing demands in video resolution leads to new video coding standards, the challenge of harnessing their computational overhead becomes important. Such overhead stems not only from the increased image data due to higher resolutions but also from the coding techniques per se that are introduced by each standard to improve compression. All modern standards in the field of video coding offer high compression efficiency, but this is achieved by increasing the computational complexity of the encoding part. Ultra-High-Definition (UHD) videos, bring new encoding implementation schemes that are being recommended for CPU and GPU parallelization. Therefore, several works are published to achieve better performance and reduce encoding complexity. Following this idea, we proposed and evaluated a hybrid encoding scheme that utilizes the constant growth of the CPU power with the massive GPU popularity in parallel. Taking advantage of the encoding schemes from the leading video coding standards, such as High-Efficiency Video Coding (HEVC) and Versatile Video Coding (VVC), which support parallel processing thru Wavefront or Tiling, in our work, we combined both of them at the same time as a whole, and in addition, we introduced a GPU-friendly fast search algorithm that is highly parallel and alternative to the default non-parallel TZ-Search. Through an experimental evaluation with common test sequences, the proposed GPU Fast Motion Estimation with our previous Wavefront per Tile Parallelism (WTP) was shown to provide valid trade-off between speedup and video coding efficiency, effectively combining the best of two worlds, i.e., WTP using CPUs and parallel Motion Estimation with GPUs.

show abstract

“…For better combining the previous parallelization techniques on CPU, [9] proposed two joint algorithms which are based on WPP and on a traditional GOP-based division pattern. On the other hand, the effective parallel implementation of crucial parts in ME is very important [10], [11]. Sayadi et al [10] propose the memory optimization strategies for better making full use of GPU resources and accelerating ME.…”

Section: Introductionmentioning

confidence: 99%

“…On the other hand, the effective parallel implementation of crucial parts in ME is very important [10], [11]. Sayadi et al [10] propose the memory optimization strategies for better making full use of GPU resources and accelerating ME. Because the calculation of SAD or SSD is a time-consuming part in ME, [11] proposes a fast parallel implementation of SAD or SSD using parallel reduction technique.…”

Section: Introductionmentioning

confidence: 99%

A Novel Parallel Motion Estimation Design and Implementation on GPU

Zhang

Zhao

et al. 2019

IEEE Access

View full text Add to dashboard Cite

The development of high-resolution video mounts a serious challenge to the previous video coding standard. The appearance of the new generation standards greatly relieves the dilemma but increases the coding complexity dramatically. Motion estimation is considered as the module with a relatively high computational complexity. In this paper, a parallel motion estimation implementation is proposed, which includes pre-motion estimation, integer motion estimation, and fractional motion estimation. They are highly accelerated on GPU based on AVS2, which is one of the new generation standards. A rapid mapping table algorithm is introduced to improve the efficiency of data access. In addition, a quasi-integral-graph algorithm is designed to calculate SAD or SATD efficiently for blocks of different sizes. The two novel techniques can effectively improve the utilization and efficiency of threads and exploit the characteristics of GPU. The experimental results show that the proposed parallel method can effectively accelerate the motion estimation.

show abstract

CUDA memory optimisation strategies for motion estimation

Cited by 11 publications

References 14 publications

Fraction Execution Resolver Using a Hybrid Multi-CPU/GPU Encoding Scheme

Fraction Execution Resolver Using a Hybrid Multi-CPU/GPU Encoding Scheme

On Combining Wavefront and Tile Parallelism with a Novel GPU-Friendly Fast Search

A Novel Parallel Motion Estimation Design and Implementation on GPU

Contact Info

Product

Resources

About