Practical performance portability in the Parallel Ocean Program (POP)

Jones, Philip W.; Worley, Patrick H; Yoshida, Yasuhiro; White, James Boyd; Levesque, John

doi:10.1002/cpe.894

Cited by 73 publications

(44 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For such evaluation, we experimented with (1) a complex parallel I/O benchmark, Flash I/O [1], which is closely modeled after the FLASH astrophysics code, and (2) a production-scale climate simulation application, the Parallel Ocean Program (POP) [7]. More details on these workloads are given below.…”

Section: Resultsmentioning

confidence: 99%

Scalable I/O tracing and analysis

Vijayakumar

Mueller

et al. 2009

Proceedings of the 4th Annual Workshop on Petascale Data Storage

View full text Add to dashboard Cite

As supercomputer performance approached and then surpassed the petaflop level, I/O performance has become a major performance bottleneck for many scientific applications. Several tools exist to collect I/O traces to assist in the analysis of I/O performance problems. However, these tools either produce extremely large trace files that complicate performance analysis, or sacrifice accuracy to collect high-level statistical information. We propose a multi-level trace generator tool, ScalaIOTrace, that collects traces at several levels in the HPC I/O stack. ScalaIOTrace features aggressive trace compression that generates trace files of near constant size for regular I/O patterns and orders of magnitudes smaller for less regular ones. This enables the collection of I/O and communication traces of applications running on thousands of processors.Our contributions also include automated trace analysis to collect selected statistical information of I/O calls by parsing the compressed trace on-the-fly and time-accurate replay of communication events with MPI-IO calls. We evaluated our approach with

show abstract

Section: Resultsmentioning

confidence: 99%

Scalable I/O tracing and analysis

Vijayakumar

Mueller

et al. 2009

Proceedings of the 4th Annual Workshop on Petascale Data Storage

View full text Add to dashboard Cite

show abstract

“…The model grid (192×128×20) generated internally is an equally-spaced latitude-longitude global grid with idealized land-masses. The x1 benchmark is set up to be identical to the actual production configuration of the Community Climate System Model [41]. The model grid (320×384×40), topography, initial state, equation of state coefficients and other benchmark specifications for x1 are available at the POP website [70].…”

Section: Parallel Ocean Programmentioning

confidence: 99%

Tuning parallel applications in parallel

2009

View full text Add to dashboard Cite

Auto-tuning has recently received significant attention from the High Performance Computing community. Most auto-tuning approaches are specialized to work either on specific domains such as dense linear algebra and stencil computations, or only at certain stages of program execution such as compile time and runtime.Real scientific applications, however, demand a cohesive environment that can efficiently provide auto-tuning solutions at all stages of application development and deployment. Towards that end, we describe a unified end-to-end approach to autotuning scientific applications. Our system, Active Harmony, takes a search-based collaborative approach to auto-tuning. Application programmers, library writers and compilers collaborate to describe and export a set of performance related tunable parameters to the Active Harmony system. These parameters define a tun- Active Harmony supports runtime adaptive code-generation and tuning for parameters that require new code (e.g. unroll factors). Effectively, we merge traditional feedback directed optimization and just-in-time compilation. This feature also enables application developers to write applications once and have the autotuner adjust the application behavior automatically when run on new systems. We evaluated our system on multiple large-scale parallel applications and showed that our system can improve the execution time by up to 46% compared to the original version of the program.Finally, we believe that the success of any auto-tuning research depends on how effectively application developers, domain-experts and auto-tuners communicate and work together. To that end, we have developed and released a simple and extensible language that standardizes the parameter space representation. Using this language, developers and researchers can collaborate to export tunable parameters to the tuning frameworks. Relationships (e.g. ordering, dependencies, constraints, ranking) between tunable parameters and search-hints can also be expressed.

show abstract

“…Jones [11] describes the addition of a more flexible data structure that allows efficient execution of POP on both cache and vector processors. Wang [24] describes code modifications to POP that improve performance on a specific machine architecture.…”

Section: Background and Related Workmentioning

confidence: 99%

“…The horizontal dimensions are decomposed into logically rectangular two-dimensional (2D) blocks [11]. The computational mesh is distributed across multiple processors by placing one or more 2D blocks on each processor.…”

Section: Data Structures Within Popmentioning

confidence: 99%

Inverse Space-Filling Curve Partitioning of a Global Ocean Model

Dennis¹

2007

2007 IEEE International Parallel and Distributed Processing Symposium

View full text Add to dashboard Cite

In this paper, we describe how inverse space-filling curve partitioning is used to increase the simulation rate of a global ocean model. Space-filling curve partitioning allows for the elimination of load imbalance in the computational grid due to land points. Improved load balance combined with code modifications within the conjugate gradient solver significantly increase the simulation rate of the Parallel Ocean Program at high resolution. The simulation rate for a high resolution model nearly doubled from 4.0 to 7.9 simulated years per day on 28,972 IBM Blue Gene/L processors. We also demonstrate that our techniques increase the simulation rate on 7545 Cray XT3 processors from 6.3 to 8.1 simulated years per day. Our results demonstrate how minor code modifications can have significant impact on resulting performance for very large processor counts.

show abstract

Practical performance portability in the Parallel Ocean Program (POP)

Abstract: SUMMARY

Cited by 73 publications

References 12 publications

Scalable I/O tracing and analysis

Scalable I/O tracing and analysis

Tuning parallel applications in parallel

Inverse Space-Filling Curve Partitioning of a Global Ocean Model

Contact Info

Product

Resources

About