Efforts in distributed computing of the CMS experiment at the LHC at CERN are now focusing on the functionality required to fulfill the projected needs for the HL-LHC era. Cloud and HPC resources are expected to be dominant relative to resources provided by traditional Grid sites, being also much more diverse and heterogeneous. Handling their special capabilities or limitations and maintaining global flexibility and efficiency, while also operating at scales much higher than the current capacity, are the major challenges being addressed by the CMS Submission Infrastructure team. These proceedings discuss the risks to the stability and scalability of the CMS HTCondor infrastructure extrapolated to such a scenario, thought to be derived mostly from its growing complexity, with multiple Negotiators and schedulers flocking work to multiple federated pools. New mechanisms for enhanced customization and control over resource allocation and usage, mandatory in this future scenario, are also described.
The CMS Submission Infrastructure Global Pool, built on Glidein-WMS andHTCondor, is a worldwide distributed dynamic pool responsible for the allocation of resources for all CMS computing workloads. Matching the continuously increasing demand for computing resources by CMS requires the anticipated assessment of its scalability limitations. In addition, the Global Plmust be able to expand in a more heterogeneous environment, in terms of resource provisioning (combining Grid, HPC and Cloud) and workload submissi.A dedicated testbed has been set up to simulate such conditions with the purpose of finding potential bottlenecks in the software or its configuration. This report provides a thorough description of the various scalabilitydimensions in size and complexity that are being explored for the future Global Pool, along with the analysis and solutions to the limitations proposed with the support of the GlideinWMS and HTCondor developer teams.
Hundreds of physicists analyze data collected by the Compact Muon Solenoid (CMS) experiment at the Large Hadron Collider using the CMS Remote Analysis Builder and the CMS global pool to exploit the resources of the Worldwide LHC Computing Grid. Efficient use of such an extensive and expensive resource is crucial. At the same time, the CMS collaboration is committed to minimizing time to insight for every scientist, by pushing for fewer possible access restrictions to the full data sample and supports the free choice of applications to run on the computing resources. Supporting such variety of workflows while preserving efficient resource usage poses special challenges. In this paper we report on three complementary approaches adopted in CMS to improve the scheduling efficiency of user analysis jobs: automatic job splitting, automated run time estimates and automated site selection for jobs.
Scheduling multi-core workflows in a global HTCondor pool is a multi-dimensional problem whose solution depends on the requirements of the job payloads, the characteristics of available resources, and the boundary conditions such as fair share and prioritization imposed on the job matching to resources. Within the context of a dedicated task force, CMS has increased significantly the scheduling efficiency of workflows in reusable multi-core pilots by various improvements to the limitations of the GlideinWMS pilots, accuracy of resource requests, efficiency and speed of the HTCondor infrastructure, and job matching algorithms.
The CMS experiment has an HTCondor Global Pool, composed of more than 200K CPU cores available for Monte Carlo production and the analysis of da.The submission of user jobs to this pool is handled by either CRAB, the standard workflow management tool used by CMS users to submit analysis jobs requiring event processing of large amounts of data, or by CMS Connect, a service focused on final stage condor-like analysis jobs and applications that already have a workflow job manager in place. The latest scenario canbring cases in which workflows need further adjustments in order to efficiently work in a globally distributed pool of resources. For instance, the generation of matrix elements for high energy physics processes via Madgraph5_aMC@NLO and the usage of tools not (yet) fully supported by the CMS software, such as Ten-sorFlow with GPUsupport, are tasks with particular requirements. A special adaption, either at the pool factory level (advertising GPU resources) or at the execute level (e.g: to handle special parameters that describe certain needs for the remote execute nodes during submission) is needed in order to adequately work in the CMS global pool. This contribution describes the challenges and efforts performed towards adaptingsuch workflows so they can properly profit from the Global Pool via CMS Connect.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.