New congestion control algorithms are rapidly improving datacenters by reducing latency, overcoming incast, increasing throughput and improving fairness. Ideally, the operating system in every server and virtual machine is updated to support new congestion control algorithms. However, legacy applications often cannot be upgraded to a new operating system version, which means the advances are offlimits to them. Worse, as we show, legacy applications can be squeezed out, which in the worst case prevents the entire network from adopting new algorithms. Our goal is to make it easy to deploy new and improved congestion control algorithms into multitenant datacenters, without having to worry about TCP-friendliness with nonparticipating virtual machines. This paper presents a solution we call virtualized congestion control. The datacenter owner may introduce a new congestion control algorithm in the hypervisors. Internally, the hypervisors translate between the new congestion control algorithm and the old legacy congestion control, allowing legacy applications to enjoy the benefits of the new algorithm. We have implemented proof-of-concept systems for virtualized congestion control in the Linux kernel and in VMware's ESXi hypervisor, achieving improved fairness, performance, and control over guest bandwidth allocations.
Nowadays, the efficiency and even the feasibility of traditional load-balancing policies are challenged by the rapid growth of cloud infrastructure and the increasing levels of server heterogeneity. In such heterogeneous systems with many loadbalancers, traditional solutions, such as JSQ, incur a prohibitively large communication overhead and detrimental incast effects due to herd behavior. Alternative low-communication policies, such as JSQ(d) and the recently proposed JIQ, are either unstable or provide poor performance.We introduce the Local Shortest Queue (LSQ) family of load balancing algorithms. In these algorithms, each dispatcher maintains its own, local, and possibly outdated view of the server queue lengths, and keeps using JSQ on its local view. A small communication overhead is used infrequently to update this local view. We formally prove that as long as the error in these local estimates of the server queue lengths is bounded in expectation, the entire system is strongly stable. Finally, in simulations, we show how simple and stable LSQ policies exhibit appealing performance and significantly outperform existing low-communication policies, while using an equivalent communication budget. In particular, our simple policies often outperform even JSQ due to their reduction of herd behavior. We further show how, by relying on smart servers (i.e., advanced pull-based communication), we can further improve performance and lower communication overhead.
A parallel server system is considered in which a dispatcher routes incoming jobs to a fixed number of heterogeneous servers, each with its own queue. Much effort has been previously made to design policies that use limited state information (e.g., the queue lengths in a small subset of the set of servers, or the identity of the idle servers). However, existing policies either do not achieve the stability region or perform poorly in terms of job completion time. We introduce Persistent-Idle (PI), a new, perhaps counterintuitive, load-distribution policy that is designed to work with limited state information. Roughly speaking, PI always routes to the server that has last been idle. Our main result is that this policy achieves the stability region. Because it operates quite differently from existing policies, our proof method differs from standard arguments in the literature. Specifically, large time properties of reflected random walk, along with a careful choice of a Lyapunov function, are combined to obtain a Lyapunov condition over sufficiently long-time intervals. We also provide simulation results that indicate that job completion times under PI are low for different choices of system parameters, compared with several state-of-the-art load-distribution schemes.
No abstract
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.