OSM: Off-Chip Shared Memory for GPUs

Darabi, Sina; Yousefzadeh-Asl-Miandoab, Ehsan; Akbarzadeh, Negar; Falahati, Hajar; Lotfi-Kamran, Pejman; Sadrosadati, Mohammad; Sarbazi-Azad, Hamid

doi:10.1109/tpds.2022.3154315

Cited by 9 publications

(3 citation statements)

References 64 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In AMDGPU, registers overflow data into global memory, and GPU compute units need to go through a second-level cache to access global memory, which reduces access efficiency and presents speed mismatch problems. Future work will address the storage of register overflow data to a relatively fast on-chip memory 16 , such as a L1 cache or LDS memory, to improve the access efficiency after GPU register overflow.…”

Section: Discussionmentioning

confidence: 99%

Cooperative compilation optimization of register allocation and thread management for AMDGPU

Han

Ren

Wang³

et al. 2023

2022 2nd Conference on High Performance Computing and Communication Engineering (HPCCE 2022)

View full text Add to dashboard Cite

Thread parallelism and single-thread' performance are two important factors affecting the performance of kernel functions, and they are both closely related to register allocation. According to change the thread parallelism to optimize GPU register resource allocation can effectively improve the performance of heterogeneous programs. We obtain the required number of vector registers by counting the number of virtual registers during the compilation of kernel functions, and then combine them with the number of wavefronts used to launch kernel functions for overall performance analysis, proposing a RAW compilation method for collaborative optimization of register allocation and thread management for AMDGPU, which is implemented in the LLVM compiler. It is verified that the method has a speedup ratio of about 1.12x for the Rodinia test set and about 1.4x for the quda application.

show abstract

Section: Discussionmentioning

confidence: 99%

Cooperative compilation optimization of register allocation and thread management for AMDGPU

Han

Ren

Wang³

et al. 2023

2022 2nd Conference on High Performance Computing and Communication Engineering (HPCCE 2022)

View full text Add to dashboard Cite

show abstract

“…When choosing which memory a thread will access, it must be considered which memory spaces are visible to a given thread. As each thread block is contained to a singular SM, each block is allocated shared private L1 data and instruction memory caches [75]. All threads contained within a thread block share both read and write access to the L1 cache, and threads in other blocks will not be able to view this data.…”

Section: Cuda Developer Toolkitmentioning

confidence: 99%

Systematic Review of Accelerating Time-Series Biosignal Machine Learning Processes Using GPU Architectures

Ketola,

Imtiaz

2024

Preprint

View full text Add to dashboard Cite

Background: Time-series biosignal data, representative of a physiological process, is often applied to time-sensitive machine learning applications that benefit from acceleration. Medical and research applications that process biosignal data in real-time may utilize new hardware architecture by switching from CPU to GPU devices to take advantage of the data processing speedups. In order to utilize machine learning kernels that are typically employed by a CPU on a GPU, the machine learning kernel must be reimplemented using custom compilers that can take advantage of GPU architecture. Objectives: The primary objective is to evaluate the speed of CPU-based machine learning algorithms commonly employed in biosignal process- ing and compare the speedup improvements obtained through GPU acceleration. Methods: A systematic search was conducted across multiple databases to identify studies employing GPU acceleration in biosignal processing. Inclusion and exclusion criteria are defined for GPU accel- eration studies. In this literature review, 12 studies of GPU kernel development for traditionally CPU-based kernels are analyzed. Results: It is found that a positive speedup occurs when using GPU kernels over traditional CPU-based algorithms in all instances. The speedup of GPU over CPU performance ranges between 1.87 to 27018.27 times faster. Conclusions: This review will contribute to the understanding of the role of GPU kernel development in biosignal processing, providing insights into performance improvements obtained by current GPU kernel development. The results indicate that GPU kernel development is a plausible direction to obtain real-time biosignal-based systems.

show abstract

“…[21,22], the DCT-based least-squares unwrapping algorithm is highly parallel and is very suitable for execution on a GPU. Recently, researchers have introduced the concept of off-chip shared memory, which improves the efficiency of communication between processes [23]. Shared memory enables multiple processes to access the same memory space, optimizing thread communication and reducing the data processing time.…”

Section: Introductionmentioning

confidence: 99%

Slightly Off-Axis Digital Holography Using a Transmission Grating and GPU-Accelerated Parallel Phase Reconstruction

Bai,

Chen,

Sun

et al. 2023

Photonics

View full text Add to dashboard Cite

Slightly off-axis digital holography is proposed using transmission grating to obtain quantitative phase distribution. The experimental device is based on an improved 4f optical system in which a two-window input plane is used to form the object beam and reference beam. Then, the two beams are diffracted into multiple orders by the transmission grating placed at the Fourier plane. By applying a modified Michelson configuration, the interference patterns can be generated by the object and reference beams from different diffraction orders. After translating the grating, a random phase shift can be introduced to the hologram. To demonstrate the feasibility of our method, both thick and thin phase specimens are retrieved using two carrier phase-shifting holograms. Furthermore, we use the phase reconstruction algorithm based on the NVIDIA CUDA programming model to reduce the retrieval time. Meanwhile, we optimize the discrete cosine transform (DCT)-based least-squares unwrapping algorithm to unwrap the phase. By porting the entire phase reconstruction process to the graphics processing unit (GPU), the phase retrieval acceleration and execution efficiency significantly improve. To demonstrate the feasibility of our method, it is found that our method can measure the surface profiles of standard elements, such as a plano-convex cylinder lens and a microlens array, with a relative error of about 0.5%. For holograms with a different phase shift, the root-mean-square (RMS) value of the phase difference for the main imaging region is about 0.2 rad. By accelerating the phase reconstruction with GPU implementation, a speedup ratio of about 20× for the thick phase specimen and a speedup ratio of about 15× for the thin-phase specimen can be obtained for holograms with a pixel size of 1024 × 1024.

show abstract

OSM: Off-Chip Shared Memory for GPUs

Cited by 9 publications

References 64 publications

Cooperative compilation optimization of register allocation and thread management for AMDGPU

Cooperative compilation optimization of register allocation and thread management for AMDGPU

Systematic Review of Accelerating Time-Series Biosignal Machine Learning Processes Using GPU Architectures

Slightly Off-Axis Digital Holography Using a Transmission Grating and GPU-Accelerated Parallel Phase Reconstruction

Contact Info

Product

Resources

About