2021
DOI: 10.1051/epjconf/202125102055
|View full text |Cite
|
Sign up to set email alerts
|

Reaching new peaks for the future of the CMS HTCondor Global Pool

Abstract: The CMS experiment at CERN employs a distributed computing infrastructure to satisfy its data processing and simulation needs. The CMS Submission Infrastructure team manages a dynamic HTCondor pool, aggregating mainly Grid clusters worldwide, but also HPC, Cloud and opportunistic resources. This CMS Global Pool, which currently involves over 70 computing sites worldwide and peaks at 350k CPU cores, is employed to successfully manage the simultaneous execution of up to 150k tasks. While the present infrastructu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 10 publications
0
3
0
Order By: Relevance
“…A successful scheduling is achieved by simultaneously ensuring that all available resources are efficiently used [379], a fair share of resources between users is reached, and the completion of CMS tasks follows their prioritization, minimizing job failures and manual intervention. While the SI typically manages 100k to 150k simultaneously executing tasks, recent scalability tests [380] have demonstrated the capacity of the infrastructure to sustain in excess of half a million concurrently running jobs.…”
Section: Central Processing and Productionmentioning
confidence: 99%
“…A successful scheduling is achieved by simultaneously ensuring that all available resources are efficiently used [379], a fair share of resources between users is reached, and the completion of CMS tasks follows their prioritization, minimizing job failures and manual intervention. While the SI typically manages 100k to 150k simultaneously executing tasks, recent scalability tests [380] have demonstrated the capacity of the infrastructure to sustain in excess of half a million concurrently running jobs.…”
Section: Central Processing and Productionmentioning
confidence: 99%
“…The SI team is continuously working on detecting and solving scaling bottlenecks to the infrastructure, with the support of the HTCondor and GlideinWMS developers teams. As a consequence of the accumulated experience, our setup currently includes multiple "nonstandard" customized settings, most of them aimed at avoiding the saturation of the main collector service of the pool, as described in previous reports [24]. Some of these specialized settings are: (a) Our HTCondor Connection Broker (CCB) service is running on a separate host to the CM, and configured with an enlarged pool of available connection sockets.…”
Section: Pushing the Limits Of The Cms Global Poolmentioning
confidence: 99%
“…The main goal for the Spring 2023 tests of the CMS SI was to assess the potential scalability of our Global Pool, considering the following updates since our latest tests (2021, see [24]). These include the evolution in HTCondor software (tested version 10.0. physical host for the CM processes (AMD EPYC 7302 at 3 GHz), the adoption of tokenbased authentication for HTCondor services in our SI [26], and the incrementally improved configuration of our infrastructure.…”
Section: The Si 2023 Scale Testsmentioning
confidence: 99%