Optimized code generation for finite element local assembly using symbolic manipulation

Russell, Francis P.

doi:10.1145/2491491.2491496

Cited by 7 publications

(6 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Throughout our symbolic manipulation, we retain coefficients as rational values wherever possible. Previous work [11] has shown that this has two benefits: firstly, we do not incur any form of floating point rounding errors during our expression manipulation; secondly, maintaining the exact values of coefficients facilitates more effective CSE which is beneficial for reducing hardware resource requirements. Currently our TARA-2 compiler does not directly support rationals or common sub-expression elimination, but could be modified to do so in future.…”

Section: Methodsmentioning

confidence: 99%

“…Finite difference stencil expressions are amenable to common-subexpression elimination (CSE) optimisations that target properties of polynomials and would not be applied by default in either software compilers or hardware synthesis tools. Maintaining coefficients as rationals also enables more effective CSE than is possible after conversion to floating point representations [11]. At this step, it would be possible to extract expensive operations from field expressions (e.g.…”

Section: Design Synthesis From the Tara-2 Dslmentioning

confidence: 99%

See 1 more Smart Citation

From Tensor Algebra to Hardware Accelerators: Generating Streaming Architectures for Solving Partial Differential Equations

Russell

Targett

Luk

2018

2018 IEEE 29th International Conference on Application-Specific Systems, Architectures and Processors (ASAP)

Self Cite

View full text Add to dashboard Cite

Hardware accelerators are attractive targets for running scientific simulations due to their power efficiency. Since, large software simulations can take person years to develop, it is often impractical to use hardware acceleration, which requires significantly more development effort and expertise than software development. We present the design and implementation of a proof-of-concept compiler toolchain which enables rapid prototyping of hardware finite difference solvers for partial differential equations, generated from a high-level domain specific language. Multiple fields, grid staggering and non-linear terms are supported. We demonstrate that our approach is practical by generating and evaluating hardware designs derived from the heat and simplified shallow water equations.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Design Synthesis From the Tara-2 Dslmentioning

confidence: 99%

From Tensor Algebra to Hardware Accelerators: Generating Streaming Architectures for Solving Partial Differential Equations

Russell

Targett

Luk

2018

2018 IEEE 29th International Conference on Application-Specific Systems, Architectures and Processors (ASAP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…The recent introduction of DSLs to decouple the finite element specification from its underlying implementation has facilitated the development of novel approaches. Methods based on tensor contraction [Kirby and Logg 2006] and symbolic manipulation [Russell and Kelly 2013] have been implemented. Nevertheless, it has been demonstrated that quadrature-based integration remains the most efficient choice for a wide class of problems [Ølgaard and Wells 2010], which motivates our work in COFFEE.…”

Section: Related Workmentioning

confidence: 99%

“…In particular, we address the well-known problem of optimizing the local assembly phase of the finite element method [Russell and Kelly 2013;Ølgaard and Wells 2010;Knepley and Terrel 2013;Kirby et al 2005], which can be responsible for a significant fraction of the overall computation runtime, often in the range of 30% to 60%. With respect to these studies, we propose a novel set of composable code transformations targeting, for the first time, cross-loop arithmetic intensity, with emphasis on instruction-level parallelism, redundant computation, and register locality.…”

Section: Introductionmentioning

confidence: 99%

Cross-Loop Optimization of Arithmetic Intensity for Finite Element Local Assembly

Luporini

Vărbănescu

Rathgeber

et al. 2015

ACM Trans. Archit. Code Optim.

Self Cite

View full text Add to dashboard Cite

We study and systematically evaluate a class of composable code transformations that improve arithmetic intensity in local assembly operations, which represent a significant fraction of the execution time in finite element methods. Their performance optimization is indeed a challenging issue. Even though affine loop nests are generally present, the short trip counts and the complexity of mathematical expressions, which vary among different problems, make it hard to determine an optimal sequence of successful transformations. Our investigation has resulted in the implementation of a compiler (called COFFEE) for local assembly kernels, fully integrated with a framework for developing finite element methods. The compiler manipulates abstract syntax trees generated from a domain-specific language by introducing domain-aware optimizations for instruction-level parallelism and register locality. Eventually, it produces C code including vector SIMD intrinsics. Experiments using a range of real-world finite element problems of increasing complexity show that significant performance improvement is achieved. The generality of the approach and the applicability of the proposed code transformations to other domains is also discussed.

show abstract

“…We carry out this research in the context of domain specific languages which 40 have shown excellent results in generating highly optimized implementations from high-level abstractions, thereby reducing the development effort from domain scientists [10,11,12]. In this work we report on research using the OPS [13,14,15,16] framework, an embedded domain specific language (EDSL), for implementing checkpointing.…”

Section: Introductionmentioning

confidence: 99%

Improving resilience of scientific software through a domain-specific approach

Reguly

Mudalige

Giles

et al. 2019

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

In this paper we present research on improving the resilience of the execution of scientific software, an increasingly important concern in High Performance Computing (HPC). We build on an existing high-level abstraction framework, the Oxford Parallel library for Structured meshes (OPS), developed for the solution of multi-block structured mesh-based applications, and implement an algorithm in the library to carry out checkpointing automatically, without the intervention of the user. The target applications are a hydrodynamics benchmark application from the Mantevo Suite, CloverLeaf 3D, the sparse linear solver proxy application TeaLeaf, and the OpenSBLI compressible Navier-Stokes direct numerical simulation (DNS) solver.We present (1) the basic algorithm that OPS relies on to determine the optimal checkpoint in terms of size and location, (2) improvements that supply additional information to improve the decision, (3) techniques that reduce the cost of writing the checkpoints to non-volatile storage, (4) a performance analysis of the developed techniques on a single workstation and on several c supercomputers, including ORNL's Titan.Our results demonstrate the utility of the high-level abstractions approach in automating the checkpointing process and show that performance is comparable to, or better than the reference in all cases.We demonstrate how these techniques allow to create an application-level checkpointing mechanism that is almost completely transparent to the user but also deliver near-optimal performance in terms of the impact of checkpointing on the runtime of the simulation. Specifically, we make the following contributions:1. We present the basic concepts and algorithms behind the automated check-50 pointing and recovery in OPS.2. We introduce techniques that allow further improvements and more control over the checkpointing process.

show abstract

Optimized code generation for finite element local assembly using symbolic manipulation

Cited by 7 publications

References 19 publications

From Tensor Algebra to Hardware Accelerators: Generating Streaming Architectures for Solving Partial Differential Equations

From Tensor Algebra to Hardware Accelerators: Generating Streaming Architectures for Solving Partial Differential Equations

Cross-Loop Optimization of Arithmetic Intensity for Finite Element Local Assembly

Improving resilience of scientific software through a domain-specific approach

Contact Info

Product

Resources

About