Maximilian Lam scite author profile

Speeding up distributed machine learning using codes

Lee

¹

,

Lam

²

,

Pedarsani

³

et al. 2016

View full text Add to dashboard Cite

Codes are widely used in many engineering applications to offer robustness against noise. In large-scale systems there are several types of noise that can affect the performance of distributed machine learning algorithms -straggler nodes, system failures, or communication bottlenecks -but there has been little interaction cutting across codes, machine learning, and distributed systems. In this work, we provide theoretical insights on how coded solutions can achieve significant gains compared to uncoded ones. We focus on two of the most basic building blocks of distributed learning algorithms: matrix multiplication and data shuffling. For matrix multiplication, we use codes to alleviate the effect of stragglers, and show that if the number of homogeneous workers is n, and the runtime of each subtask has an exponential tail, coded computation can speed up distributed matrix multiplication by a factor of log n. For data shuffling, we use codes to reduce communication bottlenecks, exploiting the excess in storage. We show that when a constant fraction α of the data matrix can be cached at each worker, and n is the number of workers, coded shuffling reduces the communication cost by a factor of (α + 1 n )γ(n) compared to uncoded shuffling, where γ(n) is the ratio of the cost of unicasting n messages to n users to multicasting a common message (of the same size) to n users. For instance, γ(n) n if multicasting a message to n users is as cheap as unicasting a message to one user. We also provide experiment results, corroborating our theoretical gains of the coded algorithms.

show abstract

Speeding Up Distributed Machine Learning Using Codes

Lee

¹

,

Lam

²

,

Pedarsani

³

et al. 2018

IEEE Trans. Inform. Theory

View full text Add to dashboard Cite

Codes are widely used in many engineering applications to offer robustness against noise. In large-scale systems there are several types of noise that can affect the performance of distributed machine learning algorithms -straggler nodes, system failures, or communication bottlenecks -but there has been little interaction cutting across codes, machine learning, and distributed systems. In this work, we provide theoretical insights on how coded solutions can achieve significant gains compared to uncoded ones. We focus on two of the most basic building blocks of distributed learning algorithms: matrix multiplication and data shuffling. For matrix multiplication, we use codes to alleviate the effect of stragglers, and show that if the number of homogeneous workers is n, and the runtime of each subtask has an exponential tail, coded computation can speed up distributed matrix multiplication by a factor of log n. For data shuffling, we use codes to reduce communication bottlenecks, exploiting the excess in storage. We show that when a constant fraction α of the data matrix can be cached at each worker, and n is the number of workers, coded shuffling reduces the communication cost by a factor of (α + 1 n )γ(n) compared to uncoded shuffling, where γ(n) is the ratio of the cost of unicasting n messages to n users to multicasting a common message (of the same size) to n users. For instance, γ(n) n if multicasting a message to n users is as cheap as unicasting a message to one user. We also provide experiment results, corroborating our theoretical gains of the coded algorithms. input data dist. store aggregate function Communication graph storage phase communication phase computation phase 1 2 3 1 2 3 Fig. 1: Conceptual diagram of the phases of distributed computation. The algorithmic workflow of distributed (potentially iterative) tasks, can be seen as receiving input data, storing them in distributed nodes, communicating data around the distributed network, and then computing locally a function at each distributed node. The main bottlenecks in this execution (communication, stragglers, system failures) can all be abstracted away by incorporating a notion of delays between these phases, denoted by ∆ boxes.reduced communication cost for data shuffling done in parallel machine learning algorithms. We show that when a constant fraction of the data matrix can be cached at each worker, and n is the number of workers, coded shuffling reduces the communication cost by a factor Θ(γ(n)) compared to uncoded shuffling, where γ(n) is the ratio of the cost of unicasting n messages to n users to multicasting a common message (of the same size) to n users. For instance, γ(n) n if multicasting a message to n users is as cheap as unicasting a message to one user. We would like to remark that a major innovation of our coding solutions is that they are woven into the fabric of the algorithmic design, and coding/decoding is performed over the representation field of the input data (e.g., floats or doubles).In sharp contrast to most coding ap...

show abstract

Cataloging the Visible Universe Through Bayesian Inference at Petascale

Regier

¹

,

Pamnany

²

,

Fischer

³

et al. 2018

View full text Add to dashboard Cite

Astronomical catalogs derived from wide-field imaging surveys are an important tool for understanding the Universe. We construct an astronomical catalog from 55 TB of imaging data using Celeste, a Bayesian variational inference code written entirely in the high-productivity programming language Julia. Using over 1.3 million threads on 650,000 Intel Xeon Phi cores of the Cori Phase II supercomputer, Celeste achieves a peak rate of 1.54 DP PFLOP/s. Celeste is able to jointly optimize parameters for 188M stars and galaxies, loading and processing 178 TB across 8192 nodes in 14.6 minutes. To achieve this, Celeste exploits parallelism at multiple levels (cluster, node, and thread) and accelerates I/O through Cori's Burst Buffer. Julia's native performance enables Celeste to employ high-level constructs without resorting to handwritten or generated low-level code (C/C++/Fortran), and yet achieve petascale performance.

show abstract

Cataloging the visible universe through Bayesian inference in Julia at petascale

Regier

¹

,

Fischer²,

Pamnany

³

et al. 2019

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

Astronomical catalogs derived from wide-field imaging surveys are an important tool for understanding the Universe. We construct an astronomical catalog from 55 TB of imaging data using Celeste, a Bayesian variational inference code written entirely in the high-productivity programming language Julia. Using over 1.3 million threads on 650,000 Intel Xeon Phi cores of the Cori Phase II supercomputer, Celeste achieves a peak rate of 1.54 DP PFLOP/s. Celeste is able to jointly optimize parameters for 188M stars and galaxies, loading and processing 178 TB across 8192 nodes in 14.6 minutes. To achieve this, Celeste exploits parallelism at multiple levels (cluster, node, and thread) and accelerates I/O through Cori's Burst Buffer. Julia's native performance enables Celeste to employ high-level constructs without resorting to handwritten or generated low-level code (C/C++/Fortran), and yet achieve petascale performance.

show abstract

Widening Access to Applied Machine Learning with TinyML

Reddi

¹

,

Plancher

²

,

Kennedy

³

et al. 2022

View full text Add to dashboard Cite

Broadening access to both computational and educational resources is critical to diffusing machine learning (ML) innovation. However, today, most ML resources and experts are siloed in a few countries and organizations. In this article, we describe our pedagogical approach to increasing access to applied ML through a massive open online course (MOOC) on Tiny Machine Learning (TinyML). We suggest that TinyML, applied ML on resource-constrained embedded devices, is an attractive means to widen access because TinyML leverages low-cost and globally accessible hardware and encourages the development of complete, self-contained applications, from data collection to deployment. To this end, a collaboration between academia and industry produced a four part MOOC that provides application-oriented instruction on how to develop solutions using TinyML. The series is openly available on the edX MOOC platform, has no prerequisites beyond basic programming, and is designed for global learners from a variety of backgrounds. It introduces real-world applications, ML algorithms, data-set engineering, and the ethical considerations of these technologies through hands-on programming and deployment of TinyML applications in both the cloud and on their own microcontrollers. To facilitate continued learning, community building, and collaboration beyond the courses, we launched a standalone website, a forum, a chat, and an optional course-project competition. We also open-sourced the course materials, hoping they will inspire the next generation of ML practitioners and educators and further broaden access to cutting-edge ML technologies.

show abstract

Maximilian Lam

Speeding up distributed machine learning using codes

Speeding Up Distributed Machine Learning Using Codes

Cataloging the Visible Universe Through Bayesian Inference at Petascale

Cataloging the visible universe through Bayesian inference in Julia at petascale

Widening Access to Applied Machine Learning with TinyML

Contact Info

Product

Resources

About