Yavuz Yetim scite author profile

Yavuz Yetim

5Publications

51Citation Statements Received

28Citation Statements Given

How they've been cited

How they cite others

Affiliations

Meta (Israel), Princeton University

Publications

Order By: Most citations

Understanding Capacity-Driven Scale-Out Neural Recommendation Inference

Lui

Yetim

Özkan

et al. 2021

View full text Add to dashboard Cite

Deep learning recommendation models have grown to the terabyte scale. Traditional serving schemes-that load entire models to a single server-are unable to support this scale. One approach to support this scale is with distributed serving, or distributed inference, which divides the memory requirements of a single large model across multiple servers.This work is a first-step for the systems research community to develop novel model-serving solutions, given the huge system design space. Large-scale deep recommender systems are a novel workload and vital to study, as they consume up to 79% of all inference cycles in the data center. To that end, this work describes and characterizes scale-out deep learning recommendation inference using data-center serving infrastructure. This work specifically explores latency-bounded inference systems, compared to the throughput-oriented training systems of other recent works. We find that the latency and compute overheads of distributed inference are largely a result of a model's static embedding table distribution and sparsity of input inference requests. We further evaluate three embedding table mapping strategies of three DLRM-like models and specify challenging design trade-offs in terms of end-to-end latency, compute overhead, and resource efficiency. Overall, we observe only a marginal latency overhead when the data-center scale recommendation models are served with the distributed inference manner-P99 latency is increased by only 1% in the best case configuration. The latency overheads are largely a result of the commodity infrastructure used and the sparsity of embedding tables. Even more encouragingly, we also show how distributed inference can account for efficiency improvements in data-center scale recommendation serving.

show abstract

Extracting Useful Computation from Error-Prone Processors for Streaming Applications

Yetim

Martonosi

Malik

2013

View full text Add to dashboard Cite

Understanding Capacity-Driven Scale-Out Neural Recommendation Inference

Lui¹,

Yetim²,

Özkan³

et al. 2020

Preprint

View full text Add to dashboard Cite

EPROF: An energy/performance/reliability optimization framework for streaming applications

Yetim

Malik

Martonosi

2012

View full text Add to dashboard Cite

CommGuard

2015

View full text Add to dashboard Cite

As semiconductor technology scales towards ever-smaller transistor sizes, hardware fault rates are increasing. Since important application classes (e.g., multimedia, streaming workloads) are data-error-tolerant, recent research has proposed techniques that seek to save energy or improve yield by exploiting error tolerance at the architecture/microarchitecture level. Even seemingly error-tolerant applications, however, will crash or hang due to control-flow/memory addressing errors. In parallel computation, errors involving inter-thread communication can have equally catastrophic effects. Our work explores techniques that mitigate the impact of potentially catastrophic errors in parallel computation, while still garnering power, cost, or yield benefits from data error tolerance. Our proposed CommGuard solution uses FSM-based checkers to pad and discard data in order to maintain semantic alignment between program control flow and the data communicated between processors. CommGuard techniques are low overhead and they exploit application information already provided by some parallel programming languages (e.g. StreamIt). By converting potentially catastrophic communication errors into potentially tolerable data errors, CommGuard allows important streaming applications like JPEG and MP3 decoding to execute without crashing and to sustain good output quality, even for errors as frequent as every 500μs.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Yavuz Yetim

Understanding Capacity-Driven Scale-Out Neural Recommendation Inference

Extracting Useful Computation from Error-Prone Processors for Streaming Applications

Understanding Capacity-Driven Scale-Out Neural Recommendation Inference

EPROF: An energy/performance/reliability optimization framework for streaming applications

CommGuard

Contact Info

Product

Resources

About