CALYPSO: a novel software system for fault-tolerant parallel processing on distributed platforms

Baratloo, A.; Dasgupta, Partha; Kedem, Zvi M.

doi:10.1109/hpdc.1995.518702

Cited by 44 publications

(28 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…To clarify the description of the new protocol, we distinguish between active and dormant TMUs, i.e. TMUs that handle active objects or dormant objects respectively 5 . We now describe the protocol for both kind of TMUs.…”

Section: Methodsmentioning

confidence: 99%

“…fault-tolerant networks and system reconfiguration after a fault. There has been some though, for example, FT-Linda [4], PLinda [15], Orca [16], Calypso [5], and Fail-safe PVM [17]. These systems use a combination of well known mechanisms such as replication, transactions, message logging, or checkpoints and rollbacks to provide fault-tolerance.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Fault tolerance via replication in coarse grain data-flow

Nguyen-Tuong

Grimshaw

Karpovich

1996

Parallel Symbolic Languages and Systems

View full text Add to dashboard Cite

Section: Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Fault tolerance via replication in coarse grain data-flow

Nguyen-Tuong

Grimshaw

Karpovich

1996

Parallel Symbolic Languages and Systems

View full text Add to dashboard Cite

“…In such DAGs all tasks at layer ℓ must be completed before any task at layer ℓ + 1 begins. Most previous bounds for firing-squad and eager scheduling apply only to these DAGs (see e.g., [27,4,3,2,5,25,30,28,26,7,6,8,31]. By including only critical tasks in the enabled pool, Level effectively transforms an arbitrary DAG into a synchronization-barrier DAG.…”

Section: Preliminaries: Firing-squad Scheduling With Synchronization mentioning

confidence: 99%

“…Much previous work in asynchronous parallel computing considers firing-squad and other eager-scheduling algorithms (see e.g., [2,3,4,5,7,6,8,25,26,27,28,30,31]). This prior work focuses on executing programs with full synchronization barriers, frequently PRAM programs.…”

Section: Firing-squad Schedulingmentioning

confidence: 99%

Scheduling DAGs on asynchronous processors

Bender

Phillips

2007

Proceedings of the Nineteenth Annual ACM Symposium on Parallel Algorithms and Architectures

View full text Add to dashboard Cite

This paper addresses the problem of scheduling a DAG of unit-length tasks on asynchronous processors, that is, processors having different and changing speeds. The objective is to minimize the makespan , that is, the time to execute the entire DAG. Asynchrony is modeled by an oblivious adversary , which is assumed to determine the processor speeds at each point in time. The oblivious adversary may change processor speeds arbitrarily and arbitrarily often, but makes speed decisions independently of any random choices of the scheduling algorithm.This paper gives bounds on the makespan of two randomized online firing-squad scheduling algorithms, All and Level. These two schedulers are shown to have good makespan even when asynchrony is arbitrarily extreme. Let W and D denote, respectively, the number of tasks and the longest path in the DAG, and let πave denote the average speed of the p processors during the execution.

show abstract

“…MILAN takes advantage of two execution techniques with strong theoretical foundations [5]-two-phase idempotent execution strategy, and eager scheduling-to provide programmers with the view of a fault-free virtual shared memory environment, even when the underlying resources may incur faults and exhibit wide variations in processing speeds. This support is exposed to the programmer in the form of several programming systems: Calypso [1] described in further detail below, Chime [15] which supports distributed execution of CC++ [3] programs, and Charlotte [2] which provides a web-based metacomputing infrastructure. In addition, the MILAN system consists of supporting infrastructure such as ResourceBroker (a system for dynamically managing the association and integration of resources into multiple parallel computations according to user-specified policies) and Knitting Factory (a toolkit for construction of distributed applications in an unpredictable metacomputing environment).…”

Section: The Milan Systemmentioning

confidence: 99%

Exploiting Application Tunability for Efficient, Predictable Parallel Resource Management

Chang¹,

Karamcheti²,

Kedem³

1998

View full text Add to dashboard Cite

Parallel computing is becoming increasing central and mainstream, driven both by the widespread availability of commodity SMP and high-performance cluster platforms, as well as the growing use of parallelism in general-purpose applications such as image recognition, virtual reality, and media processing. In addition to performance requirements, the latter computations impose soft real-time constraints, necessitating efficient, predictable parallel resource management. Unfortunately, traditional resource management approaches in both parallel and real-time systems are inadequate for meeting this objective; the parallel approaches focus primarily on improving application performance and/or system utilization at the cost of arbitrarily delaying a given application, while the real-time approaches are overly conservative sacrificing system utilization in order to meet application deadlines.In this paper, we propose a novel approach for increasing parallel system utilization while meeting application soft real-time deadlines. Our approach exploits the application tunability found in several general-purpose computations. Tunability refers to an application's ability to trade off resource requirements over time, while maintaining a desired level of output quality. In other words, a large allocation of resources in one stage of the computation's lifetime may compensate, in a parameterizable manner, for a smaller allocation in another stage. We first describe language extensions to support tunability in the Calypso programming system, a component of the MILAN metacomputing project, and evaluate their expressiveness using an image processing application. We then characterize the performance benefits of tunability, using a synthetic task system to systematically identify its benefits and shortcomings. Our results are very encouraging: application tunability is convenient to express, and can significantly improve parallel system utilization for computations with predictability requirements.

show abstract

CALYPSO: a novel software system for fault-tolerant parallel processing on distributed platforms

Cited by 44 publications

References 14 publications

Fault tolerance via replication in coarse grain data-flow

Fault tolerance via replication in coarse grain data-flow

Scheduling DAGs on asynchronous processors

Exploiting Application Tunability for Efficient, Predictable Parallel Resource Management

Contact Info

Product

Resources

About