Computation–communication overlap and parameter auto-tuning for scalable parallel 3-D FFT

Song, Sukhyun; Hollingsworth, Jeffrey K.

doi:10.1016/j.jocs.2015.12.001

Cited by 11 publications

(2 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The progression of nonblocking communications is manually forced by inserting testing points in the overlapping window. More recently, Song et al developed an algorithm for the 3D Fast Fourier Transform using nonblocking MPI collectives [14]. Different parameters, such as the tiling size and the frequency of MPI Test calls to force the progression, are automatically determined in order to achieve performance.…”

Section: Asynchronous Communications In Scientific Applicationsmentioning

confidence: 99%

Automatic Code Motion to Extend MPI Nonblocking Overlap Window

Nguyen

Saillard²,

Jaeger

et al. 2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

HPC applications rely on a distributed-memory parallel programming model to improve the overall execution time. This leads to spawning multiple processes that need to communicate with each other to make the code progress. But these communications involve overheads caused by network latencies or synchronizations between processes. One possible approach to reduce those overheads is to overlap communications with computations. MPI allows this solution through its nonblocking communication mode: a nonblocking communication is composed of an initialization and a completion call. It is then possible to overlap the communication by inserting computations between these two calls. The use of nonblocking collective calls is however still marginal and adds a new layer of complexity. In this paper we propose an automatic static optimization that (i) transforms blocking MPI communications into their nonblocking counterparts and (ii) performs extensive code motion to increase the size of overlapping intervals between initialization and completion calls. Our method is implemented in LLVM as a compilation pass, and shows promising results on two mini applications.

show abstract

Section: Asynchronous Communications In Scientific Applicationsmentioning

confidence: 99%

Automatic Code Motion to Extend MPI Nonblocking Overlap Window

Nguyen

Saillard²,

Jaeger

et al. 2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…In the meantime, technological advances in hardware architectures are nearing exascale speed through co-design architectural designs, abundant General Purpose Graphical Processing Units (GPGPUs), hierarchical clustering of heterogeneous machines, and so forth. Despite the growth seen in the application sector and in the hardware architectural design sector of HPC, the performances of applications, including the The HPC community, therefore, oriented their mindset to mitigate the effects of known performance issues of large scale systems such as the dynamic nature of big data in applications (data sizes), heterogeneous hardware architectures [7], energy consumption issues, scalability issues [22,39], uncertainty of resources (including data resources), and so forth [18].…”

Section: S Benedictmentioning

confidence: 99%

SCALE-EA: A Scalability Aware Performance Tuning Framework for OpenMP Applications

Benedict¹

2018

SCPE

View full text Add to dashboard Cite

Abstract. HPC application developers, including OpenMP-based application developers, have stepped forward to endeavor the future design trends of exa-scale machines, such as, increased number of threads/cores, heterogeneous architectures, multiple levels of memories, and so forth; and, they have initiated procedures to address application level challenges, such as, data-driven scalability issues, energy consumption requirements, data availability needs, and so forth. Despite the existence of manual performance tuning solutions, users still deem it to be an intricate process. This paper proposes a scalability aware autotuning framework (SCALE-EA) that automatically identifies an efficient number of threads for OpenMP parallel regions using a Firefly Algorithm ( 1. Introduction. High Performance Computing (HPC) application developments are invariably cropping up among various scientific domains, such as, High Energy Physics (HEP), bioinformatics, eyewear computing, visualizations, electronic automation, graph-based machine learning, and so forth. OpenMP based programming model is indeed reaching out to become a prominent programming model among a sector of HPC application developers owing to the adequate doctrine of standards (OpenMP 4.0 and 4.5), ease of use, controlled programming support, smooth applicability to programmers belonging to various scientific disciplines, and due to the notion of having millions of cores in future exascale machines.However, the realization of efficiently utilizing HPC applications in its present form for future large scale machines requires innovative approaches to mitigate the following possible risky scenarios:1. the performance of applications becomes more sensitive to data movement, data availability, data provenance, data management policies, and so forth -a future software-cum-hardware computing system must consider the massive storage options of machines, resiliency nature of applications, dynamic computing behavior of applications, and the dynamic nature of the data access patterns of applications (big data). 2. the current implementations of OpenMP applications might not have considered the design aspects of emerging memory models (including data persistence of modern memory architectures), infrastructural improvements, future parallel data structures, and so forth. 3. the scalability of applications might get an impoverished lead as applications are usually not ported and tested for scalable machines. 4. the energy efficiency of applications could exhibit a daunting scenario when executed on machines with varying degrees of parallelism -smaller or larger. 5. the current OpenMP application developers might not have quantified the possible uncertainties that might evolve due to the underlying future parallel software frameworks. In short, to mitigate these challenges, programmers or developers have to diligently write scalable and energy efficient parallel algorithms by employing the apt scalability features of programming languages and by considering the underlying require...

show abstract

Route to exascale: Novel mathematical methods, scalable algorithms and Computational Science skills

Alexandrov

2016

Journal of Computational Science

View full text Add to dashboard Cite

Computation–communication overlap and parameter auto-tuning for scalable parallel 3-D FFT

Cited by 11 publications

References 17 publications

Automatic Code Motion to Extend MPI Nonblocking Overlap Window

Automatic Code Motion to Extend MPI Nonblocking Overlap Window

SCALE-EA: A Scalability Aware Performance Tuning Framework for OpenMP Applications

Route to exascale: Novel mathematical methods, scalable algorithms and Computational Science skills

Contact Info

Product

Resources

About