This paper describes a building blocks approach to the design of scientific workflow systems. We discuss RADICAL-Cybertools as one implementation of the building blocks concept, showing how they are designed and developed in accordance with this approach. This paper offers three main contributions: (i) showing the relevance of the design principles underlying the building blocks approach to support scientific workflows on high performance computing platforms; (ii) illustrating a set of building blocks that enable multiple points of integration, "unifying" conceptual reasoning across otherwise very different tools and systems; and (iii) case studies discussing how RADICAL-Cybertools are integrated with existing workflow, workload, and general purpose computing systems and used to develop domainspecific workflow systems.
Abstract-The analysis of biomolecular computer simulations has become a challenge because the amount of output data is now routinely in the terabyte range. We evaluated if this challenge can be met by a parallel map-reduce approach with the Dask parallel computing library for task-graph based computing coupled with our MDAnalysis Python library for the analysis of molecular dynamics (MD) simulations. We performed a representative performance evaluation, taking into account the highly heterogeneous computing environment that researchers typically work in together with the diversity of existing file formats for MD trajectory data. We found that the underlying storage system (solid state drives, parallel file systems, or simple spinning platter disks) can be a deciding performance factor that leads to data ingestion becoming the primary bottleneck in the analysis work flow. However, the choice of the data file format can mitigate the effect of the storage system; in particular, the commonly used Gromacs XTC trajectory format, which is highly compressed, can exhibit strong scaling close to ideal due to trading a decrease in global storage access load against an increase in local per-core CPU-intensive decompression. Scaling was tested on a single node and multiple nodes on national and local supercomputing resources as well as typical workstations. Although very good strong scaling could be achieved for single nodes, good scaling across multiple nodes was hindered by the persistent occurrence of "stragglers", tasks that take much longer than all other tasks, and whose ultimate cause could not be completely ascertained. In summary, we show that, due to the focus on high interoperability in the scientific Python eco system, it is straightforward to implement map-reduce with Dask in MDAnalysis and provide an in-depth analysis of the considerations to obtain good parallel performance on HPC resources.
Different parallel frameworks for implementing data analysis applications have been proposed by the HPC and Big Data communities. In this paper, we investigate three task-parallel frameworks: Spark, Dask and RADICAL-Pilot with respect to their ability to support data analytics on HPC resources and compare them to MPI. We investigate the data analysis requirements of Molecular Dynamics (MD) simulations which are significant consumers of supercomputing cycles, producing immense amounts of data. A typical large-scale MD simulation of a physical system of O(100k) atoms over µsecs can produce from O(10) GB to O(1000) GBs of data. We propose and evaluate different approaches for parallelization of a representative set of MD trajectory analysis algorithms, in particular the computation of path similarity and leaflet identification. We evaluate Spark, Dask and RADICAL-Pilot with respect to their abstractions and runtime engine capabilities to support these algorithms. We provide a conceptual basis for comparing and understanding different frameworks that enable users to select the optimal system for each application. We also provide a quantitative performance analysis of the different algorithms across the three frameworks.
Abstract-High-performance computing platforms such as "supercomputers" have traditionally been designed to meet the compute demands of scientific applications. Consequently, they have been architected as net producers and not consumers of data. The Apache Hadoop ecosystem has evolved to meet the requirements of data processing applications and has addressed many of the traditional limitations of HPC platforms. There exist a class of scientific applications however, that need the collective capabilities of traditional high-performance computing environments and the Apache Hadoop ecosystem. For example, the scientific domains of bio-molecular dynamics, genomics and network science need to couple traditional computing with Hadoop/Spark based analysis. We investigate the critical question of how to present the capabilities of both computing environments to such scientific applications. Whereas this questions needs answers at multiple levels, we focus on the design of resource management middleware that might support the needs of both. We propose extensions to the Pilot-Abstraction so as to provide a unifying resource management layer. This is an important step towards interoperable use of HPC and Hadoop/Spark. It also allows applications to integrate HPC stages (e. g. simulations) to data analytics. Many supercomputing centers have started to officially support Hadoop environments, either in a dedicated environment or in hybrid deployments using tools such as myHadoop. This typically involves many intrinsic, environment-specific details that need to be mastered, and often swamp conceptual issues like: How best to couple HPC and Hadoop application stages? How to explore runtime trade-offs (data localities vs. data movement)? This paper provides both conceptual understanding and practical solutions to the integrated use of HPC and Hadoop environments. Our experiments are performed on state-of-the-art production HPC environments and provide middleware for multiple domain sciences.
MDAnalysis is an object-oriented Python library to analyze trajectories from molecular dynamics (MD) simulations in many popular formats. With the development of highly optimized MD software packages on high performance computing (HPC) resources, the size of simulation trajectories is growing up to many terabytes in size. However efficient usage of multicore architecture is a challenge for MDAnalysis, which does not yet provide a standard interface for parallel analysis. To address the challenge, we developed PMDA, a Python library that builds upon MDAnalysis to provide parallel analysis algorithms. PMDA parallelizes common analysis algorithms in MDAnalysis through a task-based approach with the Dask library. We implement a simple split-apply-combine scheme for parallel trajectory analysis. The trajectory is split into blocks, analysis is performed separately and in parallel on each block ("apply"), then results from each block are gathered and combined. PMDA allows one to perform parallel trajectory analysis with pre-defined analysis tasks. In addition, it provides a common interface that makes it easy to create user-defined parallel analysis modules. PMDA supports all schedulers in Dask, and one can run analysis in a distributed fashion on HPC machines, ad-hoc clusters, a single multi-core workstation or a laptop. We tested the performance of PMDA on single node and multiple nodes on a national supercomputer. The results show that parallelization improves the performance of trajectory analysis and, depending on the analysis task, can cut down time to solution from hours to minutes. Although still in alpha stage, it is already used on resources ranging from multi-core laptops to XSEDE supercomputers to speed up analysis of molecular dynamics trajectories. PMDA is available as open source under the GNU General Public License, version 2 and can be easily installed via the pip and conda package managers.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.