John Allspaw scite author profile

John Allspaw

5Publications

23Citation Statements Received

6Citation Statements Given

How they've been cited

How they cite others

Affiliations

Ansys (United States)

Publications

Order By: Most citations

Resilience Engineering: Learning to Embrace Failure

Robbins

Krishnan

Allspaw

et al. 2012

Queue

View full text Add to dashboard Cite

In the early 2000s, Amazon created GameDay, a program designed to increase resilience by purposely injecting major failures into critical systems semi-regularly to discover flaws and subtle dependencies. Basically, a GameDay exercise tests a company’s systems, software, and people in the course of preparing for a response to a disastrous event. Widespread acceptance of the GameDay concept has taken a few years, but many companies now see its value and have started to adopt their own versions. This discussion considers some of those experiences.

show abstract

Fault Injection in Production

Allspaw

2012

Queue

View full text Add to dashboard Cite

When we build Web infrastructures at Etsy, we aim to make them resilient. This means designing them carefully so that they can sustain their (increasingly critical) operations in the face of failure.Thankfully, there have been a couple of decades and reams of paper spent on researching how fault tolerance and graceful degradation can be brought to computer systems. That helps the cause.To make sure that the resilience built into Etsy systems is sound and that the systems behave as expected, we have to see the failures being tolerated in production.Why production? Why not simulate this in a QA or staging environment? First, the existence of any differences in those environments brings uncertainty to the exercise, and second, the risk of not recovering has no consequences during testing, which can bring hidden assumptions into the faulttolerance design and into recovery. The goal is to reduce uncertainty, not increase it.Forcing failures to happen, or even designing systems to fail on their own, generally isn't easily sold to management. Engineers are not conditioned to embrace their ability to respond to emergencies; they aim to avoid them altogether. Taking a detailed look at how to respond better to failure is essentially accepting that failure will happen, which you might think is counter to what you want in engineering, or in business.Take, for example, what you would normally think of as a simple case: the provisioning of a server or cloud instance from zero to production:1. Bare metal (or cloud-compute instance) is made available.2. Base operating system is installed via PXE (preboot execution environment) or machine image.3. Operating-system-level configurations are put into place (via configuration management or machine image).4. Application-level configurations are put into place (via configuration management, app deployment, or machine image). 5. Application code is put into place and underlying services are started correctly (via configuration management, app deployment, or machine image). 6. Systems integration takes place in the network (load balancers, VLANs, routing, switching, DNS, etc.). This is probably an oversimplification, and each step or layer is likely to represent a multitude of CPU cycles; disk, network and/or memory operations; and various numbers of software mechanisms.All of these come together to bring a node into production.Operability means that you can have confidence in this node coming into production, possibly joining a cluster, and serving live traffic seamlessly every time it happens. Furthermore, you want and expect to have confidence that if the underlying power, configuration, application, or compute resources (CPU, disk, memory, network, etc.) experience a fault, then you can survive such a fault by some means: allowing the application to degrade gracefully, rebuild itself, take itself out of production, and alert on the specifics of the fault, etc.At a high level, production fault injection should be considered one of many approaches used to gain confidence in the safety ...

show abstract

Revealing the critical role of human performance in software

Woods

Allspaw²

2020

Commun. ACM

View full text Add to dashboard Cite

show abstract

Fault injection in production

Allspaw

2012

Commun. ACM

View full text Add to dashboard Cite

show abstract

Revealing the Critical Role of Human Performance in Software

Woods

Allspaw²

2019

Queue

View full text Add to dashboard Cite

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

John Allspaw

Resilience Engineering: Learning to Embrace Failure

Fault Injection in Production

Revealing the critical role of human performance in software

Fault injection in production

Revealing the Critical Role of Human Performance in Software

Contact Info

Product

Resources

About