Mapping the FDTD Application to Many-Core Chip Architectures

Orozco, Daniel; Gao, Guang R.

doi:10.1109/icpp.2009.44

Cited by 43 publications

(28 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition, other transformations such as tiling of stencil computations for multicore architectures have been addressed in [43], [25], [21], [34]. Recently, memory customization for stencils has been proposed in [36].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures

Henretty

Stock

Pouchet

et al. 2011

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Stencil computations are at the core of applications in many domains such as computational electromagnetics, image processing, and partial differential equation solvers used in a variety of scientific and engineering applications. Short-vector SIMD instruction sets such as SSE and VMX provide a promising and widely available avenue for enhancing performance on modern processors. However a fundamental memory stream alignment issue limits achieved performance with stencil computations on modern short SIMD architectures. In this paper, we propose a novel data layout transformation that avoids the stream alignment conflict, along with a static analysis technique for determining where this transformation is applicable. Significant performance increases are demonstrated for a variety of stencil codes on several modern processors with SIMD capabilities.

show abstract

Section: Related Workmentioning

confidence: 99%

“…FDTD 2D This kernel is the core computation in the widely used Finite Difference Time Domain method in Computational Electromagnetics [34] Rician Denoise 2D This application performs noise removal from MRI images and involves an iterative loop that performs a sequence of stencil operations.…”

Section: Jacobi 1/2/3dmentioning

confidence: 99%

Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures

Henretty

Stock

Pouchet

et al. 2011

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…Cyclops-64 has been described extensively in previous publications [10,16,5]. Cyclops-64 was chosen for our experiments because its large number of execution units allow excellent studies in scalability and parallelism for HPC programs.…”

Section: Many-core Architecture Usedmentioning

confidence: 99%

TIDeFlow: The Time Iterated Dependency Flow Execution Model

Orozco

Garcia

Pavel

et al. 2011

2011 First Workshop on Data-Flow Execution Models for Extreme Scale Computing

View full text Add to dashboard Cite

The many-core revolution brought forward by recent advances in computer architecture has created immense challenges in the writing of parallel programs for High Performance Computing (HPC). Development of parallel HPC programs remains an art, and a universal doctrine for synchronization, scheduling and execution in general has not been found for many-core/multi-core architectures. These issues are exacerbated by the popularity of traditional execution models derived from the serial-programming school of thought. Previous solutions for parallel programming, such as OpenMP, MPI and similar models, require significant effort from the programmer to achieve high performance.This paper provides an introduction to the Time Iterated Dependency Flow (TIDeFlow) model, a parallel execution model inspired by dataflow, and a description of its associated runtime system. TIDeFlow was designed for efficient development of high performance parallel programs for many-core architectures.The TIDeFlow execution model was designed to efficiently express (1) parallel loops, (2) dependencies (data, control or other) between parallel loops and (3) to allow composability of programs.TIDeFlow is a work in progress. This paper presents an introduction to the TIDeFlow execution model and shows examples and preliminary results to illustrate the qualities of TIDeFlow.The main contributions of this paper are:1. A brief description of the TIDeFlow execution model, and its programming model, 2. A description of the implementation of the TIDeFlow runtime system and its capabilities and 3. Preliminary results showing the suitability of TIDeFlow to express parallel programs in many-core archiPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

show abstract

“…In cache aware time skewing schemes, flat parallelization strategies are applied [11,12,18]. The cache sizes are known, so it is clear when it is better to parallelize the execution of the sub-tiles, forcing them into different caches, and when to leave them in the same cache for better data locality and process them sequentially with a single thread.…”

Section: Parallelism and Localitymentioning

confidence: 99%

Cache oblivious parallelograms in iterative stencil computations

Strzodka

Shaheen

Pająk

et al. 2010

Proceedings of the 24th ACM International Conference on Supercomputing

View full text Add to dashboard Cite

We present a new cache oblivious scheme for iterative stencil computations that performs beyond system bandwidth limitations as though gigabytes of data could reside in an enormous on-chip cache. We compare execution times for 2D and 3D spatial domains with up to 128 million double precision elements for constant and variable stencils against hand-optimized naive code and the automatic polyhedral parallelizer and locality optimizer PluTo and demonstrate the clear superiority of our results.The performance benefits stem from a tiling structure that caters for data locality, parallelism and vectorization simultaneously. Rather than tiling the iteration space from inside, we take an exterior approach with a predefined hierarchy, simple regular parallelogram tiles and a locality preserving parallelization. These advantages come at the cost of an irregular work-load distribution but a tightly integrated loadbalancer ensures a high utilization of all resources.

show abstract

Mapping the FDTD Application to Many-Core Chip Architectures

Cited by 43 publications

References 20 publications

Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures

Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures

TIDeFlow: The Time Iterated Dependency Flow Execution Model

Cache oblivious parallelograms in iterative stencil computations

Contact Info

Product

Resources

About