The performance and behavior of large-scale distributed applications is highly influenced by network properties such as latency, bandwidth, packet loss, and jitter. For instance, an engineer might need to answer questions such as: What is the impact of an increase in network latency in application response time? How does moving a cluster between geographical regions affect application throughput? How network dynamics affects application stability? Answering these questions in a systematic and reproducible way is very hard, given the variability and lack of control over the underlying network. Unfortunately, state-of-the-art network emulation or testbeds scale poorly (i.e., MiniNet), focus exclusively on the control-plane (i.e., CrystalNet) or ignore network dynamics (i.e., EmuLab).Kollaps is a fully distributed network emulator that address these limitations. Kollaps hinges on two key observations. First, from an application's perspective, what matters are the emergent end-to-end properties (e.g., latency, bandwidth, packet loss, and jitter) rather than the internal state of the routers and switches leading to those properties. This premise allows us to build a simpler, dynamically adaptable, emulation model that circumvent maintaining the full network state. Second, this simplified model is maintainable in a fully decentralized way, allowing the emulation to scale with the number of machines for the application.Kollaps is fully decentralized, agnostic of the application language and transport protocol, scales to thousands of processes and is accurate when compared against a bare-metal deployment or state-of-the-art approaches that emulate the full state of the network. We showcase how Kollaps can accurately reproduce results from the literature and predict * Corresponding author † Corresponding author the behaviour of a complex unmodified distributed key-value store (i.e., Cassandra) under different deployments.also arise in the reproducibility crisis currently plaguing the system's community [29,64]. Different results for the same system emerge not only because systems are evaluated in different uncontrollable conditions, but also because research testbeds such as Emulab [46], CloudLab [72], or PlanetLab [35] used to conduct experiments tend to get overloaded right before system conference deadlines [51]. We therefore need tools to systematically assess and reproduce the evaluation of large-scale applications. Notably, similar tools could be used to test and (otherwise) prevent costly incidents due to mis-configurations, as in recent events [4,20]. One approach to systematically evaluate a large-scale distributed system is to resort to simulation, which relies on models that capture the key properties of the target system and environment [25]. Simulation provides full control of the system and environment -achieving full reproducibility -and allows to study the model of the system in a variety of scenarios. However, simulations suffer from several well-known problems. In fact, there is a large gap between the...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.