Dmitrii Ustiugov scite author profile

Serverless computing has seen rapid adoption due to its high scalability and flexible, pay-as-you-go billing model. In serverless, developers structure their services as a collection of functions, sporadically invoked by various events like clicks. High inter-arrival time variability of function invocations motivates the providers to start new function instances upon each invocation, leading to significant cold-start delays that degrade user experience. To reduce cold-start latency, the industry has turned to snapshotting, whereby an image of a fully-booted function is stored on disk, enabling a faster invocation compared to booting a function from scratch.This work introduces vHive, an open-source framework for serverless experimentation with the goal of enabling researchers to study and innovate across the entire serverless stack. Using vHive, we characterize a state-of-the-art snapshot-based serverless infrastructure, based on industry-leading Containerd orchestration framework and Firecracker hypervisor technologies. We find that the execution time of a function started from a snapshot is 95% higher, on average, than when the same function is memoryresident. We show that the high latency is attributable to frequent page faults as the function's state is brought from disk into guest memory one page at a time. Our analysis further reveals that functions access the same stable working set of pages across different invocations of the same function. By leveraging this insight, we build REAP, a light-weight software mechanism for serverless hosts that records functions' stable working set of guest memory pages and proactively prefetches it from disk into memory. Compared to baseline snapshotting, REAP slashes the cold-start delays by 3.7×, on average. CCS CONCEPTS• Computer systems organization → Cloud computing; • Information systems → Computing platforms; Data centers; • Software and its engineering → n-tier architectures.

show abstract

Prefetched Address Translation

Margaritov

Ustiugov

Bugnion

et al. 2019

View full text Add to dashboard Cite

With explosive growth in dataset sizes and increasing machine memory capacities, per-application memory footprints are commonly reaching into hundreds of GBs. Such huge datasets pressure the TLB, resulting in frequent misses that must be resolved through a page walk-a long-latency pointer chase through multiple levels of the in-memory radix tree-based page table. Anticipating further growth in dataset sizes and their adverse * This work was done while the author was at EPFL.

show abstract

Analyzing Tail Latency in Serverless Clouds with STeLLAR

Ustiugov

Amariucai

Grot

2021

View full text Add to dashboard Cite

Serverless computing has seen rapid adoption because of its instant scalability, flexible billing model, and economies of scale. In serverless, developers structure their applications as a collection of functions invoked by various events like clicks, and cloud providers take responsibility for cloud infrastructure management. As with other cloud services, serverless deployments require responsiveness and performance predictability manifested through low average and tail latencies. While the average end-to-end latency has been extensively studied in prior works, existing papers lack a detailed characterization of the effects of tail latency in real-world serverless scenarios and their root causes.In response, we introduce STeLLAR, an open-source serverless benchmarking framework, which enables an accurate performance characterization of serverless deployments. STeLLAR is provider-agnostic and highly configurable, allowing the analysis of both end-to-end and per-component performance with minimal instrumentation effort. Using STeLLAR, we study three leading serverless clouds and reveal that storage accesses and bursty function invocation traffic are key factors impacting tail latency in modern serverless systems. Finally, we identify important factors that do not contribute to latency variability, such as the choice of language runtime.

show abstract

SABRes: Atomic object reads for in-memory rack-scale computing

Daglis¹,

Ustiugov²,

Novaković³

et al. 2016

View full text Add to dashboard Cite

Modern in-memory services rely on large distributed object stores to achieve the high scalability essential to service thousands of requests concurrently. The independent and unpredictable nature of incoming requests results in random accesses to the object store, triggering frequent remote memory accesses. State-of-the-art distributed memory frameworks leverage the one-sided operations offered by RDMA technology to mitigate the traditionally high cost of remote memory access. Unfortunately, the limited semantics of RDMA onesided operations bound remote memory access atomicity to a single cache block; therefore, atomic remote object access relies on software mechanisms. Emerging highly integrated rackscale systems that reduce the latency of one-sided operations to a small multiple of DRAM latency expose the overhead of these software mechanisms as a major latency contributor. This technology-triggered paradigm shift calls for new onesided operations with stronger semantics. We take a step in that direction by proposing SABRes, a new one-sided operation that provides atomic remote object reads in hardware. We then present LightSABRes, a lightweight hardware accelerator for SABRes that removes all atomicity-associated software overheads. Compared to a state-of-the-art software atomicity mechanism, LightSABRes improve the throughput of a microbenchmark atomically accessing 128B-8KB objects from remote memory by 15-97%, and the throughput of a modern in-memory distributed object store by 30-60%.

show abstract

Algorithm/Architecture Co-Design for Near-Memory Processing

Drumond¹,

Daglis²,

Mirzadeh³

et al. 2018

SIGOPS Oper. Syst. Rev.

View full text Add to dashboard Cite

With mainstream technologies to couple logic tightly with memory on the horizon, near-memory processing has re-emerged as a promising approach to improving performance and energy for data-centric computing. DRAM, however, is primarily designed for density and low cost, with a rigid internal organization that favors coarse-grain streaming rather than byte-level random access. This paper makes the case that treating DRAM as a block-oriented streaming device yields significant efficiency and performance benefits, which motivate for algorithm/architecture co-design to favor streaming access patterns, even at the price of a higher order algorithmic complexity. We present the Mondrian Data Engine that drastically improves the runtime and energy efficiency of basic in-memory analytic operators, despite doing more work as compared to traditional CPU-optimized algorithms, which heavily rely on random accesses and deep cache hierarchies

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.