Proceedings of the Tenth European Conference on Computer Systems 2015
DOI: 10.1145/2741948.2741976
|View full text |Cite
|
Sign up to set email alerts
|

Taming uncertainty in distributed systems with help from the network

Abstract: Network and process failures cause complexity in distributed applications. When a remote process does not respond, the application cannot tell if the process or network have failed, or if they are just slow. Without this information, applications can lose availability or correctness. To address this problem, we propose Albatross, a service that quickly reports to applications the current status of a remote process-whether it is working and reachable, or not. Albatross is targeted at data centers equipped with … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
0

Year Published

2016
2016
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 13 publications
(10 citation statements)
references
References 54 publications
0
10
0
Order By: Relevance
“…Consequently, Paxos implementations rely on heartbeats and keep-alives with conservative end-to-end timeouts to ascertain the state of processes. Recent failure detectors [42][43][44] quickly and reliably detect failures and kickstart recovery mechanisms in asynchronous settings using a combination of local, host-based monitors that track the health of components across the stack, and lethal force. In cases where failures are suspected but cannot be confirmed, these detectors forcibly kill the process-the intuition behind this protocol (called STONITH or "Shoot the Other Node in the Head") is that unnecessary failures are preferable to uncertainty.…”
Section: Case Study: Paxosmentioning
confidence: 99%
See 1 more Smart Citation
“…Consequently, Paxos implementations rely on heartbeats and keep-alives with conservative end-to-end timeouts to ascertain the state of processes. Recent failure detectors [42][43][44] quickly and reliably detect failures and kickstart recovery mechanisms in asynchronous settings using a combination of local, host-based monitors that track the health of components across the stack, and lethal force. In cases where failures are suspected but cannot be confirmed, these detectors forcibly kill the process-the intuition behind this protocol (called STONITH or "Shoot the Other Node in the Head") is that unnecessary failures are preferable to uncertainty.…”
Section: Case Study: Paxosmentioning
confidence: 99%
“…Revocation is needed because compute element failures are not always failstop and the system must prevent a temporarily unavailable compute element from returning and corrupting state. The ToR switch can redirect cross-rack traffic to the new compute element using OpenFlow rules; further, it can also use these rules to fence the old compute element off from the rest of the network [43].…”
Section: Paxos Reconfiguration In Ddcsmentioning
confidence: 99%
“…We classify existing failure detection methods into three categories: (1) basic heartbeats [30,71] in which a dedicated heartbeat process running on each component indicates the status (UP or DOWN) of the component; (2) service-aware heartbeats [23,38,44,46] in which the heartbeat process validates the liveness and functional correctness of the services 2 ; and (3) client view observations [33,77] in which failures are detected based on the observations of the clients.…”
Section: The State Of the Artmentioning
confidence: 99%
“…Kaleidoscope is built upon the wealth body of work on failure detection [6,21,30,33,34,45,47]. In §2.1, we discuss the failure patterns that cannot be handled by the state-ofthe-art failure detection methods.…”
Section: Related Workmentioning
confidence: 99%
“…Albatross [35] discusses the challenges faced by distributed systems, and aims to mitigate them by leveraging SDN. The challenges such as split-brain scenarios and violations in consistency and availability that are addressed by Albatross are relevant for CPS too.…”
Section: Related Workmentioning
confidence: 99%