A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function

Chang, Chen-Ting; Chen, Yu‐Sheng; Wu, I-Wei; Shann, Jyh-Jiun

doi:10.1007/978-3-642-35473-1_62

Cited by 3 publications

(3 citation statements)

References 1 publication

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Thus, we allocate contiguous chunks of size N c × N c to hold a full sub-quadtree together with a N b × N b blocking at the lowest level. The chunks are processed using OpenMP and the code can potentially be used without modification on the Intel Xeon Phi coprocessor and through automatic source code translation [114,102,121,57] on GPGPUs. In addition, the use of OpenMP removes Charm++ compile and runtime dependencies for single-node applications, potentially significantly simplifying the build process.…”

Section: Fig 21mentioning

confidence: 99%

Solvers for $\mathcal{O} (N)$ Electronic Structure in the Strong Scaling Limit

Bock

Challacombe

Kalé

2016

SIAM J. Sci. Comput.

View full text Add to dashboard Cite

We present a hybrid OpenMP/Charm++ framework for solving the O(N ) Self-Consistent-Field eigenvalue problem with parallelism in the strong scaling regime, P N , where P is the number of cores, and N a measure of system size, i.e. the number of matrix rows/columns, basis functions, atoms, molecules, etc. This result is achieved with a nested approach to Spectral Projection and the Sparse Approximate Matrix Multiply [Bock and Challacombe, SIAM J. Sci. Comput. 35 C72, 2013], and involves a recursive, task-parallel algorithm, often employed by generalized N -Body solvers, to occlusion and culling of negligible products in the case of matrices with decay. Employing classic technologies associated with generalized N -Body solvers, including over-decomposition, recursive task parallelism, orderings that preserve locality, and persistence-based load balancing, we obtain scaling beyond hundreds of cores per molecule for small water clusters ([H 2 O] N , N ∈ {30, 90, 150}, P/N ≈ {819, 273, 164}) and find support for an increasingly strong scalability with increasing system size N .

show abstract

Section: Fig 21mentioning

confidence: 99%

Solvers for $\mathcal{O} (N)$ Electronic Structure in the Strong Scaling Limit

Bock

Challacombe

Kalé

2016

SIAM J. Sci. Comput.

View full text Add to dashboard Cite

show abstract

“…The compiler front-end [5] involves lexical analyzing AST parsing, syntax analyzing, address qualifier parsing, vector parsing, CGIR expansion, and WHIRL lowering optimization passes. The compile process of character stream is shown in Figure 9.…”

Section: Compilermentioning

confidence: 99%

“…It is a challenging task to support OpenCL program model on multicore DSP for embedded application. We address this problem by firstly utilizing the LLVM (low level virtual machine) [5] and Clang [6] open source compiler to support kernel compilation and further optimization for the DSP platform; then we designed 2 Advances in Mechanical Engineering [7,8] scheduler that aimed to schedule work-item in a work group to decrease the task switching overhead. Finally, we proposed a kind of software managed CACHE method to efficiently administrate the distributed global memory which was combined through interconnections such as PCIE, SRIO (serial rapid IO), Hyperlink, and SGMII.…”

Section: Introductionmentioning

confidence: 99%

A Two-Level Task Scheduler on Multiple DSP System for OpenCL

Tian

Cai

Zhou

2014

Advances in Mechanical Engineering

View full text Add to dashboard Cite

This paper addresses the problem that multiple DSP system does not support OpenCL programming. With the compiler, runtime, and the kernel scheduler proposed, an OpenCL application becomes portable not only between multiple CPU and GPU, but also between embedded multiple DSP systems. Firstly, the LLVM compiler was imported for source-to-source translation in which the translated source was supported by CCS. Secondly, two-level schedulers were proposed to support efficient OpenCL kernel execution. The DSP/BIOS is used to schedule system level tasks such as interrupts and drivers; however, the synchronization mechanism resulted in heavy overhead during task switching. So we designed an efficient second level scheduler especially for OpenCL kernel work-item scheduling. The context switch process utilizes the 8 functional units and cross path links which was superior to DSP/BIOS in the aspect of task switching. Finally, dynamic loading and software managed CACHE were redesigned for OpenCL running on multiple DSP system. We evaluated the performance using some common OpenCL kernels from NVIDIA, AMD, NAS, and Parboil benchmarks. Experimental results show that the DSP OpenCL can efficiently exploit the computing resource of multiple cores.

show abstract

PRODA: improving parallel programs on GPUs through dependency analysis

Xiong

Peng

et al. 2017

Cluster Comput

View full text Add to dashboard Cite

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function

Cited by 3 publications

References 1 publication

Solvers for $\mathcal{O} (N)$ Electronic Structure in the Strong Scaling Limit

Solvers for $\mathcal{O} (N)$ Electronic Structure in the Strong Scaling Limit

A Two-Level Task Scheduler on Multiple DSP System for OpenCL

PRODA: improving parallel programs on GPUs through dependency analysis

Contact Info

Product

Resources

About