Operational Intelligence for Distributed Computing Systems for Exascale Science

Girolamo, A. Di; Legger, F.; Paparrigopoulos, Panos; Klimentov, A.; Schovancová, J.; Kuznetsov, Valentin; Lassnig, M.; Clissa, Luca; Rinaldi, L.; Sharma, Mayank Mohan; Bakhshiansohi, H.; Zvada, M.; Bonacorsi, D.; Tisbeni, Simone Rossi; Giommi, L.; Decker, Leticia; Diotalevi, T.; Grigorieva, Maria; Padolski, S.

doi:10.1051/epjconf/202024503017

Cited by 5 publications

(3 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…For this reason, several communities involved in the Worldwide LHC Computing Grid have started a project named Operational Intelligence 2 that aims at increasing the level of automation in computing operations, thus reducing human interventions. As a result of the joint effort, several strategies have already been proposed to support operational workflows in various ways [16][17][18]26]. Some works address anomaly detection by leveraging overall workloads-e.g.…”

Section: Related Workmentioning

confidence: 99%

Analyzing WLCG File Transfer Errors Through Machine Learning

Clissa

Lassnig²,

Rinaldi³

2022

Comput Softw Big Sci

View full text Add to dashboard Cite

The increasingly growing scale of modern computing infrastructures solicits more ingenious and automatic solutions to their management. Our work focuses on file transfer failures within the Worldwide Large Hadron Collider Computing Grid and proposes a pipeline to support distributed data management operations by suggesting potential issues to investigate. Specifically, we adopt an unsupervised learning approach leveraging Natural Language Processing and Machine Learning tools to automatically parse error messages and group similar failures. The results are presented in the form of a summary table containing the most common textual patterns and time evolution charts. This approach has two main advantages. First, the joint elaboration of the error string and the transfer’s source/destination enables more informative and compact troubleshooting, as opposed to inspecting each site and checking unique messages separately. As a by-product, this also reduces the number of errors to check by some orders of magnitude (from unique error strings to unique categories or patterns). Second, the time evolution plots allow operators to immediately filter out secondary issues (e.g. transient or in resolution) and focus on the most serious problems first (e.g. escalating failures). As a preliminary assessment, we compare our results with the Global Grid User Support ticketing system, showing that most of our suggestions are indeed real issues (direct association), while being able to cover 89% of reported incidents (inverse relationship).

show abstract

Section: Related Workmentioning

confidence: 99%

Analyzing WLCG File Transfer Errors Through Machine Learning

Clissa

Lassnig²,

Rinaldi³

2022

Comput Softw Big Sci

View full text Add to dashboard Cite

show abstract

“…Como o Tier-1é uma infraestrutura dedicada aos experimentos de física (Di Girolamo et al, 2020),é necessário otimizar os recursos usados para manter a operacionalidade do sistema. Para tal, uma possível abordagemé identificar quais trechos de log têm prioridade de processamento, baseado em uma maior probabilidade de encontrar informaçõesúteis para a manutenção do sistema.…”

Section: Preliminaresunclassified

Detecção de Anomalias em Logs para Manutenção Preditiva baseada em Sistema Fuzzy Evolutivo Fracamente Supervisionado

Decker

Leite

2020

Anais Do Congresso Brasileiro De Automática 2020

View full text Add to dashboard Cite

A detecção de anomalia de comportamento de sistemas é crucial para a manutenção preditiva e a segurança dos dados em centros de computação. Centro de computação é qualquer rede de computadores que permita aos usuários compartilhar dados e recursos computacionais. Em geral, logs são dados não-estruturados (arquivos) produzidos por processos estocásticos não-estacionários. Propomos uma abordagem de inteligência computacional em tempo real para monitorar e classificar o comportamento de sistemas baseado em logs usando um esquema de janela deslizante em conjunto com um gráfico de controle estatístico para encontrar estruturas e classificar logs em relação a graus de anomalia. Os resultados de classificação são melhorados a partir do eGFC (evolving Gaussian Fuzzy Classifier), que gera e atualiza um modelo granular fuzzy a partir de fluxo de dados contínuo fracamente rotulados. O sistema fuzzy evolutivo de monitoramento de centrais de computação tem produzido resultados encorajadores em termos de acurácia e eficiência em processamento em tempo real para aplicações de aprendizado de máquina online.

show abstract

“…In addition, user logs are noticed as service-oriented unstructured data. Large volumes of data are produced by a number of system logs, which makes the implementation of a general-purpose log-based predictive maintenance solution challenging [10]. Logging activity means the rate of lines written in a log file.…”

Section: Introductionmentioning

confidence: 99%

Comparison of Evolving Granular Classifiers applied to Anomaly Detection for Predictive Maintenance in Computing Centers

Decker¹,

Leite²,

Romano³

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

Log-based predictive maintenance of computing centers is a main concern regarding the worldwide computing grid that supports the CERN (European Organization for Nuclear Research) physics experiments. A log, as event-oriented adhoc information, is quite often given as unstructured big data. Log data processing is a time-consuming computational task. The goal is to grab essential information from a continuously changeable grid environment to construct a classification model. Evolving granular classifiers are suited to learn from time-varying log streams and, therefore, perform online classification of the severity of anomalies. We formulated a 4-class online anomaly classification problem, and employed time windows between landmarks and two granular computing methods, namely, Fuzzyset-Based evolving Modeling (FBeM) and evolving Granular Neural Network (eGNN), to model and monitor logging activity rate. The results of classification are of utmost importance for predictive maintenance because priority can be given to specific time intervals in which the classifier indicates the existence of high or medium severity anomalies.

show abstract

Operational Intelligence for Distributed Computing Systems for Exascale Science

Cited by 5 publications

References 4 publications

Analyzing WLCG File Transfer Errors Through Machine Learning

Analyzing WLCG File Transfer Errors Through Machine Learning

Detecção de Anomalias em Logs para Manutenção Preditiva baseada em Sistema Fuzzy Evolutivo Fracamente Supervisionado

Comparison of Evolving Granular Classifiers applied to Anomaly Detection for Predictive Maintenance in Computing Centers

Contact Info

Product

Resources

About