Modern software-based services are implemented as distributed systems with
complex behavior and failure modes. Many large tech organizations are using
experimentation to verify the reliability of such systems. We use the term
"Chaos Engineering" to refer to this approach, and discuss the underlying
principles and how to use it to run experiments
Computational scientists developing software for HPC systems face unique software engineering issues. Attempts to transfer SE technologies to this domain must take these issues into account.
Current cloud computing infrastructure typically assumes a homogeneous collection of commodity hardware, with details about hardware variation intentionally hidden from users. In this paper, we present our approach for extending the traditional notions of cloud computing to provide a cloud-based access model to clusters that contain a heterogeneous architectures and accelerators. We describe our ongoing work extending the OpenStack cloud computing stack to support heterogeneous architectures and accelerators, and our experiences running OpenStack on our local heterogeneous cluster testbed.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.