Shay Vargaftik scite author profile

New congestion control algorithms are rapidly improving datacenters by reducing latency, overcoming incast, increasing throughput and improving fairness. Ideally, the operating system in every server and virtual machine is updated to support new congestion control algorithms. However, legacy applications often cannot be upgraded to a new operating system version, which means the advances are offlimits to them. Worse, as we show, legacy applications can be squeezed out, which in the worst case prevents the entire network from adopting new algorithms. Our goal is to make it easy to deploy new and improved congestion control algorithms into multitenant datacenters, without having to worry about TCP-friendliness with nonparticipating virtual machines. This paper presents a solution we call virtualized congestion control. The datacenter owner may introduce a new congestion control algorithm in the hypervisors. Internally, the hypervisors translate between the new congestion control algorithm and the old legacy congestion control, allowing legacy applications to enjoy the benefits of the new algorithm. We have implemented proof-of-concept systems for virtualized congestion control in the Linux kernel and in VMware's ESXi hypervisor, achieving improved fairness, performance, and control over guest bandwidth allocations.

show abstract

LSQ: Load Balancing in Large-Scale Heterogeneous Systems With Multiple Dispatchers

Vargaftik

Keslassy

Orda

2020

IEEE/ACM Trans. Networking

View full text Add to dashboard Cite

Nowadays, the efficiency and even the feasibility of traditional load-balancing policies are challenged by the rapid growth of cloud infrastructure and the increasing levels of server heterogeneity. In such heterogeneous systems with many loadbalancers, traditional solutions, such as JSQ, incur a prohibitively large communication overhead and detrimental incast effects due to herd behavior. Alternative low-communication policies, such as JSQ(d) and the recently proposed JIQ, are either unstable or provide poor performance.We introduce the Local Shortest Queue (LSQ) family of load balancing algorithms. In these algorithms, each dispatcher maintains its own, local, and possibly outdated view of the server queue lengths, and keeps using JSQ on its local view. A small communication overhead is used infrequently to update this local view. We formally prove that as long as the error in these local estimates of the server queue lengths is bounded in expectation, the entire system is strongly stable. Finally, in simulations, we show how simple and stable LSQ policies exhibit appealing performance and significantly outperform existing low-communication policies, while using an equivalent communication budget. In particular, our simple policies often outperform even JSQ due to their reduction of herd behavior. We further show how, by relying on smart servers (i.e., advanced pull-based communication), we can further improve performance and lower communication overhead.

show abstract

Persistent-Idle Load-Distribution

et al. 2020

View full text Add to dashboard Cite

A parallel server system is considered in which a dispatcher routes incoming jobs to a fixed number of heterogeneous servers, each with its own queue. Much effort has been previously made to design policies that use limited state information (e.g., the queue lengths in a small subset of the set of servers, or the identity of the idle servers). However, existing policies either do not achieve the stability region or perform poorly in terms of job completion time. We introduce Persistent-Idle (PI), a new, perhaps counterintuitive, load-distribution policy that is designed to work with limited state information. Roughly speaking, PI always routes to the server that has last been idle. Our main result is that this policy achieves the stability region. Because it operates quite differently from existing policies, our proof method differs from standard arguments in the literature. Specifically, large time properties of reflected random walk, along with a careful choice of a Lyapunov function, are combined to obtain a Lyapunov condition over sufficiently long-time intervals. We also provide simulation results that indicate that job completion times under PI are low for different choices of system parameters, compared with several state-of-the-art load-distribution schemes.

show abstract

DRIVE: One-bit Distributed Mean Estimation

Vargaftik¹,

Basat²,

Portnoy³

et al. 2021

Preprint

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Shay Vargaftik

Virtualized Congestion Control

LSQ: Load Balancing in Large-Scale Heterogeneous Systems With Multiple Dispatchers

Persistent-Idle Load-Distribution

DRIVE: One-bit Distributed Mean Estimation

Contact Info

Product

Resources

About