This paper presents an analysis of how Linux's performance has evolved over the past seven years. Unlike recent works that focus on OS performance in terms of scalability or service of a particular workload, this study goes back to basics: the latency of core kernel operations (e.g., system calls, context switching, etc.). To our surprise, the study shows that the performance of many core operations has worsened or fluctuated significantly over the years. For example, the select system call is 100% slower than it was just two years ago. An in-depth analysis shows that over the past seven years, core kernel subsystems have been forced to accommodate an increasing number of security enhancements and new features. These additions steadily add overhead to core kernel operations but also frequently introduce extreme slowdowns of more than 100%. In addition, simple misconfigurations have also severely impacted kernel performance. Overall, we find most of the slowdowns can be attributed to 11 changes. Some forms of slowdown are avoidable with more proactive engineering. We show that it is possible to patch two security enhancements (from the 11 changes) to eliminate most of their overheads. In fact, several features have been introduced to the kernel unoptimized or insufficiently tested and then improved or disabled long after their release. Our findings also highlight both the feasibility and importance for Linux users to actively configure their systems to achieve an optimal balance between performance, functionality, and security: we discover that 8 out of the 11 changes can be avoided by reconfiguring the kernel, and the other 3 can be disabled through simple patches. By disabling the 11 changes with the goal of optimizing performance, we speed up Redis, Apache, and Nginx benchmark workloads by as much as 56%, 33%, and 34%, respectively.
When systems fail in production environments, log data is often the only information available to programmers for postmortem debugging. Consequently, programmers' decision on where to place a log printing statement is of crucial importance, as it directly affects how effective and efficient postmortem debugging can be. This paper presents Log20, a tool that determines a near optimal placement of log printing statements under the constraint of adding less than a specified amount of performance overhead. Log20 does this in an automated way without any human involvement. Guided by information theory, the core of our algorithm measures how effective each log printing statement is in disambiguating code paths. To do so, it uses the frequencies of different execution paths that are collected from a production environment by a low-overhead tracing library. We evaluated Log20 on HDFS, HBase, Cassandra, and ZooKeeper, and observed that Log20 is substantially more efficient in code path disambiguation compared to the developers' manually placed log printing statements. Log20 can also output a curve showing the trade-off between the informativeness of the logs and the performance slowdown, so that a developer can choose the right balance. CCS CONCEPTS • Computer systems organization → Reliability; • Software and its engineering → System administration;
The end goal of failure diagnosis is to locate the root cause. Prior root cause localization approaches almost all rely on statistical analysis. This paper proposes taking a different approach based on the observation that if we model an execution as a totally ordered sequence of instructions, then the root cause can be identified by the first instruction where the failure execution deviates from the non-failure execution that has the longest instruction sequence prefix in common with that of the failure execution. Thus, root cause analysis is transformed into a principled search problem to identify the non-failure execution with the longest common prefix. We present Kairux, a tool that does just that. It is, in most cases, capable of pinpointing the root cause of a failure in a distributed system, in a fully automated way. Kairux uses tests from the system's rich unit test suite as building blocks to construct the non-failure execution that has the longest common prefix with the failure execution in order to locate the root cause. By evaluating Kairux on some of the most complex, real-world failures from HBase, HDFS, and ZooKeeper, we show that Kairux can accurately pinpoint each failure's respective root cause. CCS Concepts • Computer systems organization → Reliability; • Software and its engineering → Software testing and debugging.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.