IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012) 2012
DOI: 10.1109/dsn.2012.6263946
|View full text |Cite
|
Sign up to set email alerts
|

Assessing time coalescence techniques for the analysis of supercomputer logs

Abstract: This paper presents a novel approach to assess time coalescence techniques. These techniques are widely used to reconstruct the failure process of a system and to estimate dependability measurements from its event logs. The approach is based on the use of automatically generated logs, accompanied by the exact knowledge of the ground truth on the failure process. The assessment is conducted by comparing the presumed failure process, reconstructed via coalescence, with the ground truth. We focus on supercomputer… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2012
2012
2022
2022

Publication Types

Select...
3
3
3

Relationship

1
8

Authors

Journals

citations
Cited by 25 publications
(7 citation statements)
references
References 34 publications
0
7
0
Order By: Relevance
“…For the analysis of logs of large-scale systems, certain approaches apply filtering, followed by extraction and categorization of error events [14] [15] [16]. Other analyses use approaches such as time coalescing [17]. Some studies have focused on analysis of failure characteristics of specific subsystems or system components in HPC systems, such as disks [18], DRAM memory [18] [19] [20], graphical processing units (GPU) [21].…”
Section: Related Workmentioning
confidence: 99%
“…For the analysis of logs of large-scale systems, certain approaches apply filtering, followed by extraction and categorization of error events [14] [15] [16]. Other analyses use approaches such as time coalescing [17]. Some studies have focused on analysis of failure characteristics of specific subsystems or system components in HPC systems, such as disks [18], DRAM memory [18] [19] [20], graphical processing units (GPU) [21].…”
Section: Related Workmentioning
confidence: 99%
“…Prior research activities have centered on analyzing error logs [1][2][3][4][5][6] as well as some online analysis for patterns preceding a failure, and evaluated the accuracy and efficacy of anomaly detection and proactive response [12,13]. They have addressed one or more of the following issues: basic error characteristics [1,2,5], modeling and evaluation [6,14,15], failure prediction and proactive checkpointing [16,17]. There are many challenges in systematically studying large-scale systems using operational data, such as data availability, data collection/mining and fault/failure characterization.…”
Section: Related Workmentioning
confidence: 99%
“…Although today we understand the main characteristics of failures in supercomputing environments [1][2][3][4][5][6][7][8], the issue of job and application resiliency has been less well-studied. Modern supercomputers are equipped with fault-tolerant infrastructures that are capable of protecting job and application executions from failures due to either hardware or software problems.…”
Section: Introductionmentioning
confidence: 99%
“…The distributions of the transitions and the weights for the models are reported in Table I. They have been computed by means of workload analysis [14], log analysis [15], and linear regression [16]. Simulations are performed by means of SPNP [17] with the method of batch means, so achieving steady state behavior [18], with 95% confidence intervals and maximum half-width of 5%.…”
Section: Modeling a Virtualized Batch Systemmentioning
confidence: 99%