Calling CUDA C from CUDA Fortran

Fatica, Massimiliano; Ruetsch, Gregory

doi:10.1016/b978-0-12-416970-8.00018-3

Cited by 5 publications

(6 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We convolve the image with the PSF prior to the forward projection operation, and after the backward projection operation. The reconstruction software is implemented using the PGI CUDA Fortran Compiler [42].…”

Section: Image Reconstruction Modelmentioning

confidence: 99%

Reconstruction of multi-animal PET acquisitions with anisotropically variant PSF

Arias-Valcayo,

Galve,

Herraiz

et al. 2023

Biomed. Phys. Eng. Express

View full text Add to dashboard Cite

Among other factors such as random, attenuation and scatter corrections, uniform spatial resolution is key to performing accurate quantitative studies in Positron emission tomography (PET). Particularly in preclinical PET studies involving simultaneous acquisition of multiple animals, the degradation of image resolution due to the depth of interaction (DOI) effect far from the center of the Field of View (FOV) becomes a significant concern. In this work, we incorporated a spatially-variant resolution model into a real time iterative reconstruction code to obtain accurate images of multi-animal acquisition. We estimated the spatially variant point spread function (SV-PSF) across the FOV using measurements and Monte Carlo (MC) simulations. The SV-PSF obtained was implemented in a GPU-based Ordered subset expectation maximization (OSEM) reconstruction code, which includes scatter, attenuation and random corrections. The method was evaluated with acquisitions from two preclinical PET/CT scanners of the SEDECAL Argus family: a Derenzo phantom placed 2 cm off center in the 4R-SuperArgus, and a multi-animal study with 4 mice in the 6R-SuperArgus. The SV-PSF reconstructions showed uniform spatial resolution without significant increase in reconstruction time, with superior image quality compared to the uniform PSF model.

show abstract

Section: Image Reconstruction Modelmentioning

confidence: 99%

Reconstruction of multi-animal PET acquisitions with anisotropically variant PSF

Arias-Valcayo,

Galve,

Herraiz

et al. 2023

Biomed. Phys. Eng. Express

View full text Add to dashboard Cite

show abstract

“…Hence, registers can spill to global memory. Because of the opposing factors of register spilling and higher occupancy, some experiments are often needed to obtain the optimal configuration [28]. 2.…”

Section: Configuration Parametersmentioning

confidence: 99%

Enabling Energy-Efficient and Low-Latency of Sparse Matrix-Vector Multiplication on GPUs

Ashoury,

Loni,

Khunjush

et al. 2023

Preprint

View full text Add to dashboard Cite

Sparse matrix-vector multiplication (SpMV) is an essential linear algebra operation that dominates the computing cost in many scientific applications. Due to providing massive parallelism and high memory bandwidth, GPUs are commonly used to accelerate SpMV kernels. Prior studies mainly focused on reducing the latency consumption of SpMV kernels on GPU by tackling the irregular nature of sparse matrices. However, limited attempts have been made to improve the energy efficiency (MFLOPS/Watt) of SpMV kernels, resulting in GPUs being excluded from the range of low-power scientific applications. Furthermore, prior work has primarily focused on optimizing the sparse matrix storage format; the literature ignores evaluating the impact of tweaking compilation parameters (e.g., \texttt{maxrregcount}, thread block size, and such). Lastly, little attention has been paid to preparing a comprehensive training dataset of running SpMV kernels and tweaking the hyperparameters of machine learning-based storage format predictors. To address these limitations, we present a novel learning-based framework, dubbed Auto-SpMV, that enables energy-efficient and low-latency SpMV kernels on GPUs. To achieve the best run time performance, Auto-SpMV proposes two optimization modes: \textit{compile-time} and \textit{run-time}. In the \textit{compile-time} mode, Auto-SpMV tweaks the compilation parameters according to the optimization objective, either lower latency or energy consumption. On the other hand, in the \textit{run-time} mode, Auto-SpMV selects the best sparse format for the input matrix based on an optimized machine learning model. To achieve the best classification results, 1) we collect the largest ever dataset by selecting different sparse matrices running with more than 15K configurations, and 2) we boost classification models by automatically tweaking the learning hyperparameters. Experimental results reveal that Auto-SpMV optimizes latency, energy consumption, average power, and energy efficiency in the \textit{compile-time} mode by up to 51.9%, 52%, 33.2%, and 53%, respectively, compared to the default setting. Auto-SpMV optimizes average power and energy efficiency in the \textit{run-time} mode by up to 34.6% and 99.7%, respectively, compared to the default setting. Finally, our experimental results show that Auto-SpMV generalizes to unseen matrices and hardware devices.

show abstract

“…The proposed test dataset framework, coupled with its detailed description, establishes a reliable and comprehensive foundation for the performance evaluation and comparison of GPU rendering programs in this study. 3-Progressive-photon-map [28] 4-Film_grain_rendering_gpu [29] 5-Ray-tracing-with-materials [30] 6-CUDA-OpenGL-ray-trace [31] 7-N-body-rendering [32] 8-Mandelbrot-rendering [33] 9-Cuda-mandel3d [34] 10-Cuda-ray-shadows [35] 11-Mandelbulb-ray-marching [36] 12-Mandelbrot-with-OpenGL [37] 13-LiteTracer [38] 14-Evert-cuda [39] 15-3D-stanford-bunny [40] 16-Sphere-chessboard-grid [41]…”

Section: Program Integration and Data Collectionmentioning

confidence: 99%

“…17-Ray-tracing-SDL [42] 18-Ray-tracer-texture [43] 19-Ray-tracing-optimized [44] 20-Ray-tracing-OpenGL [45] 21-Luminous-sphere-render [46] 22-Render-background-vec3 [47] 23-Rays-rendering [48] 24-Spheres-rendering [49] 25-Surface normals [50] 26-Antialiasing [51] 27-Diffuse-materials [52] 28-Metal-reflection [53] 29-Dielectrics-defocus-blur [54] 30-Newtonian-rasterization [55] 31-Ray-bounces-perlin-noise [56] 32-PathTracer-500-samples [57] 33-photon-map-ray-direction [58] 34-grain-size-film-grain-render [29] 35-Ray-tracing-with-2k-image [30] 36-ray-tracer-OpenGL-depth [31] 37-N-body-renering-2k-image [32] 38-Mandelbrot-render-iteration [33] 39-mandel3d-camera-position [34] 40-bunny-voxelizer [59] 41-Mandelbulb-ray-march [60] 42-Mandelbrot-window-dim [37] 43-LiteTracer-2k-image [38] 44-Evert-cuda-field-of-view [39] 45-3D-stanford-bunny-scene4 [40] 46-Sphere-chessboard-grid [41] 47-Ray-tracing-SDL-sphere [42] 48-Raytrace-texture-2k-image [43] 49-Ray-optimize-1k-spheres [44] 50-Ray-trace-OpenGL-sphere [45] 51-Luminous-sphere-ray-count [46] 52-Raytracing-2k-image [47] 53-Rays-render-direction [48] 54-render-2k-image-spheres [49] 55-Surface-unit-direction [50] 56-antialiasing-with-2k-image [51] 57-Diffuse-materials-2k-image [52] 58-Metal-reflection-50-samples…”

Section: Program Integration and Data Collectionmentioning

confidence: 99%

RayBench: An Advanced NVIDIA-Centric GPU Rendering Benchmark Suite for Optimal Performance Analysis

Wang,

2023

Electronics

View full text Add to dashboard Cite

This study aims to collect GPU rendering programs and analyze their characteristics to construct a benchmark dataset that reflects the characteristics of GPU rendering programs, providing a reference basis for designing the next generation of graphics processors. The research framework includes four parts: GPU rendering program integration, data collection, program analysis, and similarity analysis. In the program integration and data collection phase, 1000 GPU rendering programs were collected from open-source repositories, and 100 representative programs were selected as the initial benchmark dataset. The program analysis phase involves instruction-level, thread-level, and memory-level analysis, as well as five machine learning algorithms for importance ranking. Finally, through Pearson similarity analysis, rendering programs with high similarity were eliminated, and the final GPU rendering program benchmark dataset was selected based on the benchmark’s comprehensiveness and representativeness. The experimental results of this study show that, due to the need to load and process texture and geometry data in rendering programs, the average global memory access efficiency is generally lower compared to the averages of the Rodinia and Parboil benchmarks. The GPU occupancy rate is related to the computationally intensive tasks of rendering programs. The efficiency of stream processor execution and thread bundle execution is influenced by branch statements and conditional judgments. Common operations such as lighting calculations and texture sampling in rendering programs require branch judgments, which reduce the execution efficiency. Bandwidth utilization is improved because rendering programs reduce frequent memory access and data transfer to the main memory through data caching and reuse. Furthermore, this study used multiple machine learning methods to rank the importance of 160 characteristics of 100 rendering programs on four different NVIDIA GPUs. Different methods demonstrate robustness and stability when facing different data distributions and characteristic relationships. By comparing the results of multiple methods, biases inherent to individual methods can be reduced, thus enhancing the reliability of the results. The contribution of this study lies in the analysis of workload characteristics of rendering programs, enabling targeted performance optimization to improve the efficiency and quality of rendering programs. By comprehensively collecting GPU rendering program data and performing characteristic analysis and importance ranking using machine learning methods, reliable reference guidelines are provided for GPU design. This is of significant importance in driving the development of rendering technology.

show abstract

Calling CUDA C from CUDA Fortran

Cited by 5 publications

References 0 publications

Reconstruction of multi-animal PET acquisitions with anisotropically variant PSF

Reconstruction of multi-animal PET acquisitions with anisotropically variant PSF

Enabling Energy-Efficient and Low-Latency of Sparse Matrix-Vector Multiplication on GPUs

RayBench: An Advanced NVIDIA-Centric GPU Rendering Benchmark Suite for Optimal Performance Analysis

Contact Info

Product

Resources

About