Software prefetching

Callahan, David; Kennedy, Ken; Porterfield, Allan

doi:10.1145/106972.106979

Cited by 342 publications

(57 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finally, low level optimizations at the CPU pipeline include several well-known techniques. These techniques may be categorized into loop transformations [9], data access [2] and streaming optimizations (SMP, SIMD and MIMD).…”

Section: Boosting Numerical Codesmentioning

confidence: 99%

Unveiling WARIS Code, a Parallel and Multi-purpose FDM Framework

Cruz

Hanzich

Folch

et al. 2014

Lecture Notes in Computational Science and Engineering

View full text Add to dashboard Cite

Abstract. WARIS is an in-house multi-purpose framework focused on solving scientific problems using Finite Difference Methods as numerical scheme. Its framework was designed from scratch to solve in a parallel and efficient way Earth Science and Computational Fluid Dynamic problems on a wide variety of architectures. WARIS uses structured meshes to discretize the problem domains, as these are better suited for optimization in accelerator-based architectures. To succeed in such challenge, WARIS framework was initially designed to be modular in order to ease development cycles, portability, reusability and future extensions of the framework. In order to assess its performance, a code that solves the vectorial AdvectionDiffusion-Sedimentation equation has been ported to the WARIS framework. This problem appears in many geophysical applications, including atmospheric transport of passive substances. As an application example, we focus on atmospheric dispersion of volcanic ash, a case in which operational code performance is critical given the threat posed by this substance on aircraft engines. Preliminary results are very promising, performance has been improved by 8.2× with respect to the baseline code using a realistic case. This opens new perspectives for operational setups, including efficient ensemble forecast.

show abstract

Section: Boosting Numerical Codesmentioning

confidence: 99%

Unveiling WARIS Code, a Parallel and Multi-purpose FDM Framework

Cruz

Hanzich

Folch

et al. 2014

Lecture Notes in Computational Science and Engineering

View full text Add to dashboard Cite

show abstract

“…The required data for the scalar execution is loaded into a software-controlled data cache near to the scalar registers. To reduce data miss penalty, we applied software pre-fetching techniques 20) where pre-fetch or pre-load instructions are inserted automatically by the compiler or manually by the programmer to bring data ahead of its use. The pre-load instruction causes a matrix block to be brought from the main memory to the data cache.…”

Section: Architecture Modelmentioning

confidence: 99%

“…The pre-load instruction causes a matrix block to be brought from the main memory to the data cache. This pre-load instruction looks like a load instruction except no register is specified 20) . To preserve the integrity of data between the scalar unit data cache and the main memory, the altered blocks in the data cache must be written back (or post-stored) into the main memory before switching the computing to the matrix unit.…”

Section: Architecture Modelmentioning

confidence: 99%

Level-3 BLAS and LU Factorization on a Matrix Processor

Zekri

Sedukhin

2008

ipsjdc

View full text Add to dashboard Cite

As increasing clock frequency approaches its physical limits, a good approach to enhance performance is to increase parallelism by integrating more cores as coprocessors to generalpurpose processors in order to handle the different workloads in scientific, engineering, and signal processing applications. In this paper, we propose a many-core matrix processor model consisting of a scalar unit augmented with b×b simple cores tightly connected in a 2D torus matrix unit to accelerate matrix-based kernels. Data load/store is overlapped with computing using a decoupled data access unit that moves b×b blocks of data between memory and the two scalar and matrix processing units. The operation of the matrix unit is mainly processing fine-grained b×b matrix multiply-add (MMA) operations. We formulate the data alignment operations including matrix transposition and skewing as MMA operations in order to overlap them with data load/store. Two fundamental linear algebra algorithms are designed and analytically evaluated on the proposed matrix processor: the Level-3 BLAS kernel, GEMM, and the LU factorization with partial pivoting, the main step in solving linear systems of equations. For the GEMM kernel, the maximum speed of computing measured in FLOPs/cycle is approached for different matrix sizes, n, and block sizes, b. The speed of the LU factorization for relatively large values of n ranges from around 50-90% of the maximum speed depending on the model parameters. Overall, the analytical results show the merits of using the matrix unit for accelerating the matrix-based applications.

show abstract

“…Software prefetching is an effective technique to tolerate memory latency [4]. Software prefetching can be performed through two alternative schemes: binding and nonbinding prefetching.…”

Section: Introductionmentioning

confidence: 99%

“…The use of binding and nonbinding prefetching has been previously studied in [13,1] and [4,9,14,18,3], respectively, among others. However, there are very few works analyzing the interactions of these prefetching schemes with software pipelining techniques.…”

Section: Introductionmentioning

confidence: 99%