Bill Kramer scite author profile

Over the last twenty years, the open source community has provided more and more software on which the world's High Performance Computing (HPC) systems depend for performance and productivity. The community has invested millions of dollars and years of effort to build key components. But although the investments in these separate software elements have been tremendously valuable, a great deal of productivity has also been lost because of the lack of planning, coordination, and key integration of technologies necessary to make them work together smoothly and efficiently, both within individual PetaScale systems and between different systems. It seems clear that this completely uncoordinated development model will not provide the software needed to support the unprecedented parallelism required for peta/exascale computation on millions of cores, or the flexibility required to exploit new hardware models and features, such as transactional memory, speculative execution, and GPUs. This report describes the work of the community to prepare for the challenges of exascale computing, ultimately combing their efforts in a coordinated International Exascale Software Project.

show abstract

Toward Exascale Resilience

Cappello

Geist

Gropp

et al. 2009

The International Journal of High Performance Computing Applica

293

199

View full text Add to dashboard Cite

Over the past few years resilience has became a major issue for high-performance computing (HPC) systems, in particular in the perspective of large petascale systems and future exascale systems. These systems will typically gather from half a million to several millions of central processing unit (CPU) cores running up to a billion threads. From the current knowledge and observations of existing large systems, it is anticipated that exascale systems will experience various kind of faults many times per day. It is also anticipated that the current approach for resilience, which relies on automatic or application level checkpoint/ restart, will not work because the time for checkpointing and restarting will exceed the mean time to failure of a full system. This set of projections leaves the community of fault tolerance for HPC systems with a difficult challenge: finding new approaches, which are possibly radically disruptive, to run applications until their normal termination, despite the essentially unstable nature of exascale systems. Yet, the community has only five to six years to solve the problem. This white paper synthesizes the motivations, observations and research issues considered as determinant of several complimentary experts of HPC in applications, programming models, distributed systems and system management.

show abstract

Modeling and tolerating heterogeneous failures in large parallel systems

Heien¹,

Kondo²,

Gainaru³

et al. 2011

View full text Add to dashboard Cite

The Open Science Grid status and architecture

Pordes

Petravick

Kramer

et al. 2008

J. Phys.: Conf. Ser.

View full text Add to dashboard Cite

Abstract. The Open Science Grid (OSG) provides a distributed facility where the Consortium members provide guaranteed and opportunistic access to shared computing and storage resources. The OSG project[1] is funded by the National Science Foundation and the Department of Energy Scientific Discovery through Advanced Computing program. The OSG project provides specific activities for the operation and evolution of the common infrastructure. The US ATLAS and US CMS collaborations contribute to and depend on OSG as the US infrastructure contributing to the World Wide LHC Computing Grid on which the LHC experiments distribute and analyze their data. Other stakeholders include the STAR RHIC experiment, the Laser Interferometer Gravitational-Wave Observatory (LIGO), the Dark Energy Survey (DES) and several Fermilab Tevatron experiments-CDF, D0, MiniBoone etc. The OSG implementation architecture brings a pragmatic approach to enabling vertically integrated community specific distributed systems over a common horizontal set of shared resources and services. More information can be found at the OSG web site: www.opensciencegrid.org.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Bill Kramer

The open science grid

The International Exascale Software Project roadmap

Toward Exascale Resilience

Modeling and tolerating heterogeneous failures in large parallel systems

The Open Science Grid status and architecture

Contact Info

Product

Resources

About