OpenDwarfs: Characterization of Dwarf-Based Benchmarks on Fixed and Reconfigurable Architectures

Krommydas, Konstantinos F.; Feng, Wu-chun; Antonopoulos, Christos D.; Bellas, Nikolaos

doi:10.1007/s11265-015-1051-z

Cited by 30 publications

(16 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, work-items accessing data along a column of the matrix do not observe memory access coalescing. us we observe that while cache hit rates are typically low on this benchmark, particularly on GPUs [12], GPUs can hide the latency of global memory accesses through memory access coalescing to some extent.…”

Section: Structured Grids: Speckle Reducing Anisotropicmentioning

confidence: 81%

“…is indicates that almost all memory accesses made by the kernel are perfectly synchronized between OpenCL threads. Performance results [9,12] show that GEM performs signi cantly be er on GPUs than on CPUs, as memory unit stalls are at low levels for both CPUs and GPUs due to the highly e cient memory utilization of this benchmark. As memory operations do not present a bo leneck, this benchmark is able to take advantage of the superior oating-point compute capability of GPUs [12].…”

Section: N-body Methods: Gemmentioning

confidence: 99%

“…is rightly indicates that memory addresses accessed at each logical timestamp are very distant. On GPUs, this translates to poor utilization of both memory access coalescing and caching [12]. is trend in the parallel spatial locality metric suggests that a possible improvement for GPU performance would be to load blocks of the input matrix into on-chip local memory to reduce the number of global memory requests -this is typically performed in GPU optimized implementations of the Needleman-Wunsch algorithm [1].…”

Section: Dynamic Programming: Needleman-wunsch (Nw)mentioning

confidence: 99%

“…At each point in the kernels' memory access pro le, each work-item with global ID (i, j) in an OpenCL work-group accesses the (i, j) th elements of various matrices. e sequential memory access pa ern is non-linear since di erent matrices are accessed consecutively by the kernel, prohibiting ideal caching [12]. However, memory requests made simultaneously by a work-group always fall within a rectangular block of one of the matrices.…”

Section: Structured Grids: Speckle Reducing Anisotropicmentioning

confidence: 99%

See 3 more Smart Citations

Characterizing Optimizations to Memory Access Patterns using Architecture-Independent Program Features

Chilukuri

Milthorpe

Johnston

2020

Proceedings of the International Workshop on OpenCL

View full text Add to dashboard Cite

High-performance computing developers are faced with the challenge of optimizing the performance of OpenCL workloads on diverse architectures. e Architecture-Independent Workload Characterization (AIWC) tool is a plugin for the Oclgrind OpenCL simulator that gathers metrics of OpenCL programs that can be used to understand and predict program performance on an arbitrary given hardware architecture. However, AIWC metrics are not always easily interpreted and do not re ect some important memory access pa erns a ecting e ciency across architectures. We propose a new metric of parallel spatial locality -the closeness of memory accesses simultaneously issued by OpenCL workitems (threads). We implement the parallel spatial locality metric in the AIWC framework, and analyse gathered results on matrix multiply and the Extended OpenDwarfs OpenCL benchmarks. e di erences in the observed parallel spatial locality metric across implementations of matrix multiply re ect the optimizations performed. e new metric can be used to distinguish between the OpenDwarfs benchmarks based on the memory access pa erns a ecting their performance on various architectures. e improvements suggested to AIWC will help HPC developers be er understand memory access pa erns of complex codes and guide optimization of codes for arbitrary hardware targets.

show abstract

Section: Structured Grids: Speckle Reducing Anisotropicmentioning

confidence: 81%

Section: N-body Methods: Gemmentioning

confidence: 99%

Section: Dynamic Programming: Needleman-wunsch (Nw)mentioning

confidence: 99%

Section: Structured Grids: Speckle Reducing Anisotropicmentioning

confidence: 99%

See 2 more Smart Citations

Characterizing Optimizations to Memory Access Patterns using Architecture-Independent Program Features

Chilukuri

Milthorpe

Johnston

2020

Proceedings of the International Workshop on OpenCL

View full text Add to dashboard Cite

show abstract

“…In this paper, we use the OpenDwarfs benchmark suite [3], a suite of architecture-agnostic OpenCL kernels that capture common computation and communication patterns across a wide spectrum of scientific and engineering applications, to study the performance of the OpenCL programming model on FPGAs. In OpenDwarfs, none of the dwarfs contain optimizations that favor a specific architecture over another.…”

Section: Introductionmentioning

confidence: 99%

Bridging the Performance-Programmability Gap for FPGAs via OpenCL: A Case Study with OpenDwarfs

Krommydas

Helal

Verma

et al. 2016

2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Self Cite

View full text Add to dashboard Cite

Abstract-For decades, the streaming architecture of FPGAs has delivered accelerated performance across many application domains, such as option pricing solvers in finance, computational fluid dynamics in oil and gas, and packet processing in network routers and firewalls. However, this performance has come at the significant expense of programmability, i.e., the performanceprogrammability gap. In particular, FPGA developers use hardware design languages (HDLs) to implement the application data path and to design hardware modules for computation pipelines, memory management, synchronization, and communication. This process requires extensive low-level knowledge of the target FPGA architecture and consumes significant development time and effort.To address this lack of programmability of FPGAs, OpenCL provides an easy-to-use and portable programming model for CPUs, GPUs, APUs, and now, FPGAs. However, this significantly improved programmability can come at the expense of performance; that is, there still remains a performanceprogrammability gap. To improve the performance of OpenCL kernels on FPGAs, and thus, bridge the performanceprogrammability gap, we identify general techniques to optimize OpenCL kernels for FPGAs under device-specific hardware constraints. We then apply these optimization techniques to the OpenDwarfs benchmark suite, with its diverse parallelism profiles and memory access patterns, in order to evaluate the effectiveness of the optimizations in terms of performance and resource utilization. Finally, we present the performance of the optimized OpenDwarfs, along with their potential re-factoring, to bridge the performance gap from programming in OpenCL versus programming in a HDL.

show abstract

Programming Model

Liu¹,

Wei²,

Zhu³

et al. 2022

Software Defined Chips

View full text Add to dashboard Cite

OpenDwarfs: Characterization of Dwarf-Based Benchmarks on Fixed and Reconfigurable Architectures

Cited by 30 publications

References 16 publications

Characterizing Optimizations to Memory Access Patterns using Architecture-Independent Program Features

Characterizing Optimizations to Memory Access Patterns using Architecture-Independent Program Features

Bridging the Performance-Programmability Gap for FPGAs via OpenCL: A Case Study with OpenDwarfs

Programming Model

Contact Info

Product

Resources

About