Lianmin Zheng scite author profile

There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms -such as mobile phones, embedded devices, and accelerators (e.g., FPGAs, ASICs) -requires significant manual effort. We propose TVM, a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends. TVM solves optimization challenges specific to deep learning, such as high-level operator fusion, mapping to arbitrary hardware primitives, and memory latency hiding. It also automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of code optimizations. Experimental results show that TVM delivers performance across hardware back-ends that are competitive with state-ofthe-art, hand-tuned libraries for low-power CPU, mobile GPU, and server-class GPUs. We also demonstrate TVM's ability to target new accelerator back-ends, such as the FPGA-based generic deep learning accelerator.The system is open sourced and in production use inside several major companies.

show abstract

A Hardware–Software Blueprint for Flexible Deep Learning Specialization

Moreau

et al. 2019

View full text Add to dashboard Cite

Specialized Deep Learning (DL) acceleration stacks, designed for a specific set of frameworks, model architectures, operators, and data types, offer the allure of high performance while sacrificing flexibility. Changes in algorithms, models, operators, or numerical systems threaten the viability of specialized hardware accelerators.We propose VTA, a programmable deep learning architecture template designed to be extensible in the face of evolving workloads. VTA achieves this flexibility via a parametrizable architecture, two-level ISA, and a JIT compiler. The two-level ISA is based on (1) a task-ISA that explicitly orchestrates concurrent compute and memory tasks and (2) a microcode-ISA which implements a wide variety of operators with single-cycle tensor-tensor operations. Next, we propose a runtime system equipped with a JIT compiler for flexible code-generation and heterogeneous execution that enables effective use of the VTA architecture.VTA is integrated and open-sourced into Apache TVM, a state-ofthe-art deep learning compilation stack that provides flexibility for diverse models and divergent hardware backends. We propose a flow that performs design space exploration to generate a customized hardware architecture and software operator library that can be leveraged by mainstream learning frameworks. We demonstrate our approach by deploying optimized deep learning models used for object classification and style transfer on edge-class FPGAs.

show abstract

MAgent: A Many-Agent Reinforcement Learning Platform for Artificial Collective Intelligence

Zheng

Yang

Cai

et al. 2018

AAAI

View full text Add to dashboard Cite

We introduce MAgent, a platform to support research and development of many-agent reinforcement learning. Unlike previous research platforms on single or multi-agent reinforcement learning, MAgent focuses on supporting the tasks and the applications that require hundreds to millions of agents. Within the interactions among a population of agents, it enables not only the study of learning algorithms for agents' optimal polices, but more importantly, the observation and understanding of individual agent's behaviors and social phenomena emerging from the AI society, including communication languages, leaderships, altruism. MAgent is highly scalable and can host up to one million agents on a single GPU server. MAgent also provides flexible configurations for AI researchers to design their customized environments and agents. In this demo, we present three environments designed on MAgent and show emerged collective intelligence by learning from scratch.

show abstract

Tunable High-Intensity Electron Bunch Train Production Based on Nonlinear Longitudinal Space Charge Oscillation

Zhang

Liu

et al. 2016

Phys. Rev. Lett.

View full text Add to dashboard Cite

High-intensity trains of electron bunches with tunable picosecond spacing are produced and measured experimentally with the goal of generating terahertz (THz) radiation. By imposing an initial density modulation on a relativistic electron beam and controlling the charge density over the beam propagation, density spikes of several-hundred-ampere peak current in the temporal profile, which are several times higher than the initial amplitudes, have been observed for the first time. We also demonstrate that the periodic spacing of the bunch train can be varied continuously either by tuning launching phase of a radio-frequency gun or by tuning the compression of a downstream magnetic chicane. Narrow-band coherent THz radiation from the bunch train was also measured with μJ-level energies and tunable central frequency of the spectrum in the range of ∼0.5 to 1.6 THz. Our results pave the way towards generating mJ-level narrow-band coherent THz radiation and driving high-gradient wakefield-based acceleration.

show abstract

Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

Zheng¹,

Li²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Alpa automates model-parallel training of large deep learning (DL) models by generating execution plans that unify data, operator, and pipeline parallelism. Existing model-parallel training systems either require users to manually create a parallelization plan or automatically generate one from a limited space of model parallelism configurations, which does not suffice to scale out complex DL models on distributed compute devices. Alpa distributes the training of large DL models by viewing parallelisms as two hierarchical levels: inter-operator and intra-operator parallelisms. Based on it, Alpa constructs a new hierarchical space for massive model-parallel execution plans. Alpa designs a number of compilation passes to automatically derive the optimal parallel execution plan in each independent parallelism level and implements an efficient runtime to orchestrate the two-level parallel execution on distributed compute devices. Our evaluation shows Alpa generates parallelization plans that match or outperform handtuned model-parallel training systems even on models they are designed for. Unlike specialized systems, Alpa also generalizes to models with heterogeneous architectures and models without manually-designed plans. * Lianmin, Zhuohan, and Hao contributed equally. Part of the work was done when Lianmin interned at Amazon and Zhuohan interned at Google.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Lianmin Zheng

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

A Hardware–Software Blueprint for Flexible Deep Learning Specialization

MAgent: A Many-Agent Reinforcement Learning Platform for Artificial Collective Intelligence

Tunable High-Intensity Electron Bunch Train Production Based on Nonlinear Longitudinal Space Charge Oscillation

Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

Contact Info

Product

Resources

About