The primary function of multimedia systems is to seamlessly transform and display content to users while maintaining the perception of acceptable quality. For images and videos, perceptual quality assessment algorithms play an important role in determining what is acceptable quality and what is unacceptable from a human visual perspective. As modern image quality assessment (IQA) algorithms gain widespread adoption, it is important to achieve a balance between their computational efficiency and their quality prediction accuracy. One way to improve computational performance to meet real-time constraints is to use simplistic models of visual perception, but such an approach has a serious drawback in terms of poor-quality predictions and limited robustness to changing distortions and viewing conditions. In this paper, we investigate the advantages and potential bottlenecks of implementing a best-in-class IQA algorithm, Most Apparent Distortion, on graphics processing units (GPUs). Our results suggest that an understanding of the GPU and CPU architectures, combined with detailed knowledge of the IQA algorithm, can lead to non-trivial speedups without compromising prediction accuracy. A single-GPU and a multi-GPU implementation showed a 24× and a 33× speedup, respectively, over the baseline CPU implementation. A bottleneck analysis revealed the kernels with the highest runtimes, and a microarchitectural analysis illustrated the underlying reasons for the high runtimes of these kernels. Programs written with optimizations such as blocking that map well to CPU memory hierarchies do not map well to the GPU’s memory hierarchy. While compute unified device architecture (CUDA) is convenient to use and is powerful in facilitating general purpose GPU (GPGPU) programming, knowledge of how a program interacts with the underlying hardware is essential for understanding performance bottlenecks and resolving them.
Due to the massive popularity of digital images and videos over the past several decades, the need for automated quality assessment (QA) is greater than ever. Accordingly, the impetus on QA research has focused on improving prediction accuracy. However, for many application areas, such as consumer electronics, the runtime performance and related computational considerations are equally as important as the accuracy. Most modern QA algorithms exhibit a large computational complexity. However, the large complexity of these algorithms does not necessarily prohibit their ability of achieving low runtimes if hardware resources are used appropriately. GPUs, which offer a large amount of parallelism and a specialized memory hierarchy, should be well-suited for QA algorithm deployment.In this paper, we analyze a massively parallel GPU implementation of the most apparent distortion (MAD) full-reference image QA algorithm with optimizations guided by a microarchitectural analysis. A shared memory based implementation of the local statistics computation has yielded 25% speedup over its original implementation. We describe the optimizations that produce the best results. We also justify our optimization recommendations with descriptions of the microarchitectural underpinnings. Although our study focuses on a single algorithm, the image-processing primitives used in this algorithm are fundamentally similar to those used in most modern QA algorithms.[25] Documentation, CUDA Toolkit. "v6. 0." Santa Clara (CA, USA): NVIDIA Corporation (2014).
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.