SummaryThis paper presents CoopREP, a system that provides support for fault replication of concurrent programs based on cooperative recording and partial log combination. CoopREP uses partial logging to reduce the amount of information that a given program instance is required to store to support deterministic replay. This allows reducing substantially the overhead imposed by the instrumentation of the code, but raises the problem of finding a combination of logs capable of replaying the fault. CoopREP tackles this issue by introducing several innovative statistical analysis techniques aimed at guiding the search of the partial logs to be combined and needed for the replay phase. CoopREP has been evaluated using both standard benchmarks for multithreaded applications and real-world applications. The results highlight that CoopREP can successfully replay concurrency bugs involving tens of thousands of memory accesses, while reducing recording overhead with respect to state-of-the-art noncooperative logging schemes by up to 13× (and by 2.4× on average).
KEYWORDSconcurrency errors, debugging, partial logging, record and replay
INTRODUCTIONConcurrent programming is of paramount importance to exploit the full potential of the emerging multicore architectures. However, writing and debugging concurrent programs is notoriously difficult. Contrary to most bugs in sequential programs, which usually depend exclusively on the program input and the execution environment (and, therefore, can be more easily reproduced), concurrency bugs depend on nondeterministic interactions among threads. This means that even when re-executing the same code with identical inputs, on the same machine, the outcome of the program may differ from run to run [1].Deterministic replay (or record and replay) addresses this issue by recording nondeterministic events (such as the order of access to shared-memory locations) during a failing execution and, then, use the resulting trace to support the reproduction of the error [2]. Classic approaches, [3-6] also referred to as order-based, trace the relative order of all relevant events, thus allowing to replay the bug at the first attempt. Unfortunately, they also come with an excessively high recording cost (10×-100× slowdown), which is impractical for most settings.Motivated by the observation that the most significant performance constraints are on production runs, more recent solutions have adopted a search-based approach [1,7-9]. Search-based solutions reduce the recording overhead at the cost of a longer reproduction time during diagnosis.To this end, they typically log incomplete information at runtime and rely on post-recording exploration techniques to complete the missing data.These techniques explore various trade-offs between recording overhead and replay efficacy (ie, number of replay attempts required to reproduce the bug).The work in this paper aims at further reducing the overhead achievable using either order-based or search-based techniques, by devising cooperative logging schemes tha...