On Combining Wavefront and Tile Parallelism with a Novel GPU-Friendly Fast Search

Papaioannou, Georgios; Koziri, Maria; Loukopoulos, Thanasis; Anagnostopoulos, Ioannis

doi:10.3390/electronics12102223

Cited by 2 publications

(9 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This modified encoder uses tiles as a primary parallelism scheme but internally uses WPP for each tile separately with multiple CPU cores working in a parallel pattern. Extending the original WTP encoder, the GPU's fast search algorithm has been demonstrated (4PhaseFS) [10] to speed up the motion estimation calculations, which has been proven to be one of the most intensive tasks in a video encoder. As in our previous works, to achieve the best performance from our experiments, the application thread pool was fed with a pre-selected number of threads that matched exactly with the physical CPU cores we had available for our setup.…”

Section: Methodsmentioning

confidence: 99%

“…In contrast, optimized software algorithms have been introduced to reduce the computational load [6][7][8] in variants of existing software video encoders without any additional costs. The proposed work in this paper is also a software algorithm that extends our proven hybrid encoder [9,10] to increase the speedup times by minimizing the calculations of the fraction motion estimation (FME) part by skipping them when specific criteria are met inside a fraction execution resolver (FER) function. Our encoder already has a GPU-friendly fast integer motion estimation algorithm [10], so this extension focuses only on the fraction optimization part.…”

Section: Introductionmentioning

confidence: 99%

“…The proposed work in this paper is also a software algorithm that extends our proven hybrid encoder [9,10] to increase the speedup times by minimizing the calculations of the fraction motion estimation (FME) part by skipping them when specific criteria are met inside a fraction execution resolver (FER) function. Our encoder already has a GPU-friendly fast integer motion estimation algorithm [10], so this extension focuses only on the fraction optimization part. As we target mostly the end users, we only exploit and capitalize common hardware setups such as video cards.…”

Section: Introductionmentioning

confidence: 99%

“…We extended our previous hybrid encoding model [9,10] to support the proposed fraction execution resolver (FER) algorithm, which has been proven to be fast, easy to implement, and cost effective, minimizing in this way the computational load and thus the encoding times even further. This encoding model effectively utilizes both CPUs and GPUs in a flexible tiling selection scheme to balance speed and quality.…”

Section: Introductionmentioning

confidence: 99%

“…So, each tile uses the standard WPP encoding pattern. GPU power is engaged [10] to improve the coding efficiency for each tile separately only for the integer motion estimation (IME) part and falls back to the CPU for the fraction refinement as a subsequent and mandatory job. The fraction part always follows the IME part, but it is very difficult to parallelize the default algorithm because of the dependencies it has.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Fraction Execution Resolver Using a Hybrid Multi-CPU/GPU Encoding Scheme

Papaioannou,

Koziri,

Loukopoulos

et al. 2023

Electronics

Self Cite

View full text Add to dashboard Cite

Modern video coding standards make use of sub-pixel motion estimation to improve the video quality and reduce the bitrate. It is known that the fraction motion estimation (FME) part follows the integer motion estimation (IME) and adds an extra computational overhead due to the interpolation and the additional motion searches. In this paper, we propose a fraction execution resolver (FER) algorithm that lets the encoder skip the fraction part when specific criteria are met by introducing a preliminary fast test decision point (pFTDP) function for the IME part. If the pFTDP returns zero motion vectors (MVs) and the displacement search area center is also zero, then the fraction part is skipped. The pFTDP decision maker is executed only once, when a 2N × 2N block is first met, while all subsequent blocks follow this initial decision either by receiving the necessary MVs and RD from the pFTDP function or by using the precalculated IME values from the GPU kernel. For our experiments, we use a multithreaded CPU environment that also makes use of GPUs only for the integer part. Our evaluations provide a greater than 1600% encoding time saving at its peak in comparison with the default HEVC sequential mode and ideally a saving of greater than 2286% for still video frame sequences. The total average speedup for both Class A and Class B video sequences is ×13.45. The gain of the FER itself is more than ×3.9 compared with the same multithreaded setup environment. The PSNR and bitrate overhead observed are proportional to the tiling scheme used and are more related to the way CABAC works internally. The FER’s negative effects on coding efficiency are proven to be negligible. A balance between speed and quality achieved by using a lower tiling pattern is shown to minimize the negative effects of the encoding scheme pattern. The experimental results confirm the validity of our motivation, namely, that we can benefit from a software fraction execution resolver without any extra hardware costs. The gain is further increased when video sequences have more static blocks than others.

show abstract

Section: Methodsmentioning

confidence: 99%