Akhil Langer scite author profile

The advent of petascale computing has introduced new challenges (e.g. heterogeneity, system failure) for programming scalable parallel applications. Increased complexity and dynamism in science and engineering applications of today have further exacerbated the situation. Addressing these challenges requires more emphasis on concepts that were previously of secondary importance, including migratability, adaptivity, and runtime system introspection. In this paper, we leverage our experience with these concepts to demonstrate their applicability and efficacy for real world applications. Using the CHARM++ parallel programming framework, we present details on how these concepts can lead to development of applications that scale irrespective of the rough landscape of supercomputing technology. Empirical evaluation presented in this paper spans many miniapplications and real applications executed on modern supercomputers including Blue Gene/Q, Cray XE6, and Stampede.

show abstract

Maximizing Throughput of Overprovisioned HPC Data Centers Under a Strict Power Budget

Sarood

Langer

Gupta

et al. 2014

105

View full text Add to dashboard Cite

Abstract-Building future generation supercomputers while constraining their power consumption is one of the biggest challenges faced by the HPC community. For example, US Department of Energy has set a goal of 20 MW for an exascale (10 18 flops) supercomputer. To realize this goal, a lot of research is being done to revolutionize hardware design to build power efficient computers and network interconnects. In this work, we propose a software-based online resource management system that leverages hardware facilitated capability to constrain the power consumption of each node in order to optimally allocate power and nodes to a job. Our scheme uses this hardware capability in conjunction with an adaptive runtime system that can dynamically change the resource configuration of a running job allowing our resource manager to re-optimize allocation decisions to running jobs as new jobs arrive, or a running job terminates.We also propose a performance modeling scheme that estimates the essential power characteristics of a job at any scale. The proposed online resource manager uses these performance characteristics for making scheduling and resource allocation decisions that maximize the job throughput of the supercomputer under a given power budget. We demonstrate the benefits of our approach by using a mix of jobs with different powerresponse characteristics. We show that with a power budget of 4.75 MW, we can obtain up to 5.2X improvement in job throughput when compared with the SLURM scheduling policy that is power-unaware. We corroborate our results with real experiments on a relatively small scale cluster, in which we obtain a 1.7X improvement.

show abstract

Optimizing power allocation to CPU and memory subsystems in overprovisioned HPC systems

Sarood

Langer

Kalé

et al. 2013

View full text Add to dashboard Cite

Abstract-Energy consumption and power draw pose two major challenges to the HPC community for designing larger systems. Present day HPC systems consume as much as 10MW of electricity and this is fast becoming a bottleneck. Although energy bills will significantly increase with machine size, power consumption is a hard constraint that must be addressed. Intel's Running Average Power Limit (RAPL) toolkit is a recent feature that enables power capping of CPU and memory subsystems on modern hardware. In this paper, we use RAPL to evaluate the possibility of improving execution time efficiency of an application by capping power while adding more nodes. We profile the strong scaling of an application using different power caps for both CPU and memory subsystems. Our proposed interpolation scheme uses an application profile to optimize the number of nodes and the distribution of power between CPU and memory subsystems to minimize execution time under a strict power budget. We validate these estimates by running experiments on a 20-node (120 cores) Sandy Bridge cluster. Our experimental results closely match the model estimates and show speedups greater than 1.47X for all applications compared to not capping CPU and memory power. We demonstrate that the quality of solution that our interpolation scheme provides matches very closely to results obtained via exhaustive profiling.

show abstract

Energy-efficient computing for HPC workloads on heterogeneous manycore chips

Langer

Totoni

Palekar

et al. 2015

View full text Add to dashboard Cite

Power and energy efficiency is one of the major challenges to achieve exascale computing in the next several years. While chips operating at low voltages have been studied to be highly energy-efficient, low voltage operations lead to heterogeneity across cores within the microprocessor chip. In this work, we study chips with low voltage operation and discuss programming systems, and performance modeling in the presence of heterogeneity. We propose an integer linear programming based approach for selecting optimal configuration of a chip that minimizes its energy consumption. We obtain an average of 26% and 10.7% savings in energy consumption of the chip for two HPC mini-applications -miniMD and Jacobi, respectively. We also evaluate the energy savings with execution time constraints, using the proposed approach. These energy savings are significantly more than the savings by sub-optimal configurations obtained from heuristics.

show abstract

Scalable Algorithms for Distributed-Memory Adaptive Mesh Refinement

Langer

Lifflander

Miller

et al. 2012

View full text Add to dashboard Cite

Abstract-This paper presents scalable algorithms and data structures for adaptive mesh refinement computations. We describe a novel mesh restructuring algorithm for adaptive mesh refinement computations that uses a constant number of collectives regardless of the refinement depth. To further increase scalability, we describe a localized hierarchical coordinate-based block indexing scheme in contrast to traditional linear numbering schemes, which incur unnecessary synchronization. In contrast to the existing approaches which take O(P ) time and storage per process, our approach takes only constant time and has very small memory footprint. With these optimizations as well as an efficient mapping scheme, our algorithm is scalable and suitable for large, highly-refined meshes. We present strong-scaling experiments up to 2k ranks on Cray XK6, and 32k ranks on IBM Blue Gene/Q.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Akhil Langer

Parallel Programming with Migratable Objects: Charm++ in Practice

Maximizing Throughput of Overprovisioned HPC Data Centers Under a Strict Power Budget

Optimizing power allocation to CPU and memory subsystems in overprovisioned HPC systems

Energy-efficient computing for HPC workloads on heterogeneous manycore chips

Scalable Algorithms for Distributed-Memory Adaptive Mesh Refinement

Contact Info

Product

Resources

About