Despite the best efforts of software engineers to produce high-quality software, inevitably some bugs escape even the most rigorous testing process and are first encountered by end users. When this happens, such failures must be understood quickly, the underlying bugs fixed, and deployments patched to avoid another user (or the same one) running into the same problem again. As far back as 1951, the dawn of modern computing, Stanley Gill 6 wrote that "some attention has, therefore, been given to the problem of dealing with mistakes after the programme has been tried and found to fail." Gill went on to describe the first use of "the post-mortem technique" in software, whereby the running program was modified to record important system state as it ran so that the programmer could later understand what happened and why the software failed.Since then, postmortem debugging technology has been developed and used in many different systems, including all major consumer and enterprise operating systems, as well as the native execution environments on those systems. These environments make up much of today's core infrastructure, from the operating systems that underlie every application to core services such as DNS (Domain Name System), and thus form the building blocks of nearly all larger systems. To achieve the high levels of reliability expected from such software, these systems are designed to restore service quickly after each failure while preserving enough information that the failure itself can later be completely understood.While such software was historically written in C and other native environments, core infrastructure is increasingly being developed in dynamic languages, from Java over the past two decades to server-side JavaScript over the past 18 months. Dynamic languages are attractive for many reasons, not least of which is that they often accelerate the development of complex software.Conspicuously absent from many of these environments, however, are facilities for even basic postmortem debugging, which makes understanding production failures extremely difficult.Dynamic languages must bridge this gap and provide rich tools for understanding failures in deployed systems in order to match the reliability demanded from their growing role in the bedrock of software systems. To understand the real potential for sophisticated postmortem analysis tools, we will first review the state of debugging today and the role of postmortem analysis tools in other environments. We will then examine the unique challenges around building such tools for dynamic environments and the state of such tools today.
DEBUGGING IN THE LARGETo understand the unique value of postmortem debugging, it's worth examining the alternative.Both native and dynamic environments today provide facilities for in situ debugging, or debugging faulty programs while they're still running. This typically involves attaching a separate debugger program to the faulty program and then directing execution of the faulty program interactively,