2009 International Conference on Parallel Processing 2009
DOI: 10.1109/icpp.2009.20
|View full text |Cite
|
Sign up to set email alerts
|

CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems

Abstract: Considerable work has been done on providing fault tolerance capabilities for different software components on large-scale high-end computing systems. Thus far, however, these fault-tolerant components have worked insularly and independently and information about faults is rarely shared. Such lack of system-wide fault tolerance is emerging as one of the biggest problems on leadership-class systems. In this paper, we propose a coordinated infrastructure, named CIFTS, that enables system software components to s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
41
0

Year Published

2011
2011
2023
2023

Publication Types

Select...
4
2
1

Relationship

2
5

Authors

Journals

citations
Cited by 53 publications
(41 citation statements)
references
References 19 publications
0
41
0
Order By: Relevance
“…In particular, we have published approximately 30 publications, presented 40 talks and 5 posters, and conducted Birds-of-a-Feather sessions and round-table discussion sessions every year for the past four years at the IEEE/ACM Supercomputing conference, and we have given more than 25 demonstrations of our research at various venues. The FTB design paper [36] has been cited two dozen times and by several independent sources, in less than two years. The community-targeted Birds-of-a-Feather sessions held at the IEEE/ACM Supercomputing conference [24][25][26][27] have been popular, with more than 50 attendees each year.…”
Section: Outreach and Education Activitiesmentioning
confidence: 99%
See 2 more Smart Citations
“…In particular, we have published approximately 30 publications, presented 40 talks and 5 posters, and conducted Birds-of-a-Feather sessions and round-table discussion sessions every year for the past four years at the IEEE/ACM Supercomputing conference, and we have given more than 25 demonstrations of our research at various venues. The FTB design paper [36] has been cited two dozen times and by several independent sources, in less than two years. The community-targeted Birds-of-a-Feather sessions held at the IEEE/ACM Supercomputing conference [24][25][26][27] have been popular, with more than 50 attendees each year.…”
Section: Outreach and Education Activitiesmentioning
confidence: 99%
“…In summary, the majority of the FTB logic lies with the FTB agent. Further details about the design of the FTB implementation can be found in [36].…”
Section: The Ftb Software -The Cifts Ftb Api Implementationmentioning
confidence: 99%
See 1 more Smart Citation
“…The STCI runtime provides basic monitoring and failure detectors that could be useful for experiment monitoring. The publish/subscribe services in STCI and CiFTS Fault Tolerant Backplane (FTB) [2] are also good candidates for implementing these event notification channels.…”
Section: Monitoring and Event Loggingmentioning
confidence: 99%
“…We applaud the HPC community for increasing releases of failure data [2], and efforts to provide tools for measuring application efficiency [6,1,5], and recommend that simple tools be developed and made available for wide use. The required data resides within distinct communities (namely, administrators and users) -another example that solutions to the resilience issues facing exascale computing will require unified efforts to overcome [7,21]. With teamwork and open minds, the prospects of yet-more-extreme computing remains bright, despite the anticipated resilience challenges.…”
mentioning
confidence: 99%