We describe a high-throughput computing system for running jobs on public and private computing clouds using the HTCondor job scheduler and the cloudscheduler VM provisioning service. The distributed cloud computing system is designed to simultaneously use dedicated and opportunistic cloud resources at local and remote locations. It has been used for large-scale production particle physics workloads for many years using thousands of cores on three continents. A decade after its initial design and implementation, cloudscheduler has been modernized to take advantage of new software designs, improved operating system capabilities and support packages. The updated cloudscheduler is more resilient and scalable, with expanded capabilities. We present an overview of the original design and then describe the new version of the distributed compute cloud system. We conclude with a review of the current status and future plans.
Input data for applications that run in cloud computing centres can be stored at distant repositories, often with multiple copies of the popular data stored at many sites. Locating and retrieving the remote data can be challenging, and we believe that federating the storage can address this problem. A federation would locate the closest copy of the data on the basis of GeoIP information. Currently we are using the dynamic data federation Dynafed, a software solution developed by CERN IT. Dynafed supports several industry standards for connection protocols like Amazon's S3, Microsoft's Azure, as well as WebDAV and HTTP. Dynafed functions as an abstraction layer under which protocol-dependent authentication details are hidden from the user, requiring the user to only provide an X509 certificate. We have setup an instance of Dynafed and integrated it into the ATLAS data distribution management system. We report on the challenges faced during the installation and integration. We have tested ATLAS analysis jobs submitted by the PanDA production system and we report on our first experiences with its operation.
The dynamic data federation software Dynafed, developed by CERN IT, provides a federated storage cluster on demand using the HTTP protocol with WebDAV extensions. Traditional storage sites which support an experiment can be added to Dynafed without requiring any changes to the site. Dynafed also supports direct access to cloud storage such as S3 and Azure. We report on the usage of Dynafed to support Belle-II production jobs running on a distributed cloud system utilizing clouds across North America. Cloudscheduler, developed by the University of Victoria HEP Research Computing group , federates Openstack, OpenNebula, Amazon, Google, and Microsoft cloud compute resources and provides them as a unified Grid site which on average runs about 3500 Belle-II production jobs in parallel. The input data for those jobs is accessible through a single endpoint, our Dynafed instance. This Dynafed instance unifies storage resources provided by Amazon S3, Ceph, and Minio object stores as endpoints, as well as storage provided by traditional DPM and dCache sites. We report on our long term experience with this setup, the implementation of a grid-mapfile based X509 authentication/authorization for Belle-II access, and we show how a federated cluster can be used by Belle-II through
The Simulation at Point1 (Sim@P1) project was established in 2013 to take advantage of the Trigger and Data Acquisition High Level Trigger (HLT) farm of the ATLAS experiment at the LHC. The HLT farm is a significant compute resource, which is critical to ATLAS during data taking. This large compute resource is used to generate and process simulation data for the experiment when ATLAS is not recording data. The Sim@P1 system uses virtual machines, deployed by OpenStack, in order to isolate the resources from the ATLAS technical and control network. During the upcoming long shutdown in 2019 (LS2), the HLT farm including the Sim@P1 infrastructure will be upgraded. A previous paper on the project emphasized the need for "simple, reliable, and efficient tools" to quickly switch between data acquisition operation and offline processing. In this contribution we assess various options for updating and simplifying the provisional tools. Cloudscheduler is a tool for provisioning cloud resources for batch computing that has been managing cloud resources in HEP offline computing since 2012. We present the argument for choosing Cloudscheduler, and describe technical details regarding optimal utilization of the Sim@P1 resources. *
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.