A microservice architecture features hundreds or even thousands of small loosely coupled services with multiple instances. Because microservice performance depends on many factors including the workload, inter-service traffic management is complex in such dynamic environments. Service meshes aim to handle this complexity and to facilitate management, observability, and communication between microservices. Service meshes provide various traffic management policies such as circuit breaking and retry mechanisms, which are claimed to protect microservices against overload and increase the robustness of communication between microservices. However, there have been no systematic studies on the effects of these mechanisms on microservice performance and robustness. Furthermore, the exact impact of various tuning parameters for circuit breaking and retries are poorly understood. This work presents a large set of experiments conducted to investigate these issues using a representative microservice benchmark in a Kubernetes testbed with the widely used Istio service mesh. Our experiments reveal effective configurations of circuit breakers and retries. The findings presented will be useful to engineers seeking to configure service meshes more systematically and also open up new areas of research for academics in the area of service meshes for (autonomic) microservice resource management.
CCS CONCEPTS• Computer systems organization → Cloud computing; Reliability; • General and reference → Empirical studies.
Site Reliability Engineers are at the center of two tensions: On one hand, they need to respond to alerts within a short time, to restore a non-functional system. On the other hand, short response times is disruptive to everyday life and lead to alert fatigue. To alleviate this tension, many resource management mechanisms are proposed handle overload and mitigate the faults. One recent such mechanism is circuit breaking in service meshes. Circuit breaking rejects incoming requests to protect latency at the expense of availability (successfully answered requests), but in many scenarios achieve neither due to the difficulty of knowing when to trigger circuit breaking in highly dynamic microservice environments.We propose an adaptive circuit breaking mechanism, implemented through an adaptive controller, that not only avoids overload and mitigate failure, but keeps the tail response time below a given threshold while maximizing service throughput. Our proposed controller is experimentally compared with a static circuit breaker across a wide set of overload scenarios in a testbed based on Istio and Kubernetes. The results show that our controller maintains tail response time below the given threshold 98% of the time (including cold starts) on average with an availability of 70% with 29% of requests circuit broken. This compares favorably to a static circuit breaker configuration, which features a 63% availability, 30% circuit broken requests, and more than 5% of requests timing out.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.