Input-aware auto-tuning of compute-bound HPC kernels

Tillet, Philippe; Cox, David D.

doi:10.1145/3126908.3126939

Cited by 28 publications

(23 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Table III summarizes the results of co-running Conv2DBackpropF ilter and Conv2DBackpropInput. The input size for the operations is par_input (32,8,8,2048). Given this input size, the number of threads to achieve the best performance for the two operations is 68.…”

Section: Motivation Examplesmentioning

confidence: 99%

See 1 more Smart Citation

Runtime Concurrency Control and Operation Scheduling for High Performance Neural Network Training

Liu

Kestor

et al. 2019

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

View full text Add to dashboard Cite

Training neural network (NN) often uses a machine learning framework such as TensorFlow and Caffe2. These frameworks employ a dataflow model where the NN training is modeled as a directed graph composed of a set of nodes. Operations in NN training are typically implemented by the frameworks as primitives and represented as nodes in the dataflow graph. Training NN models in a dataflow-based machine learning framework involves a large number of fine-grained operations whcih present diverse memory access patterns and computation intensity. Managing and scheduling those operations is challenging, because we have to decide the number of threads to run each operation (concurrency control) and schedule those operations for good hardware utilization and system throughput.In this paper, we extend an existing runtime system (the TensorFlow runtime) to enable automatic concurrency control and scheduling of operations. We explore performance modeling to predict the performance of operations with various threadlevel parallelism. Our performance model is highly accurate and lightweight. Leveraging the performance model, our runtime system employs a set of scheduling strategies that co-run operations to improve hardware utilization and system throughput.Our runtime system demonstrates a significant performance benefit. Comparing with using the recommended configurations for concurrency control and operation scheduling in TensorFlow, our approach achieves 36% performance (execution time) improvement on average (up to 49%) for four neural network models, and achieves high performance close to the optimal one manually obtained by the user.

show abstract

Section: Motivation Examplesmentioning

confidence: 99%

“…To address the second challenge, we can replace some timeconsuming operations (e.g. convolutions) in cuDNN with operations from an open-source library with better or comparable performance (e.g., ISAAC [32]) such that we can control intraop parallelism at runtime.…”

Section: B Future Workmentioning

confidence: 99%

Runtime Concurrency Control and Operation Scheduling for High Performance Neural Network Training

Liu

Kestor

et al. 2019

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

View full text Add to dashboard Cite

show abstract

“…Our approach is closely related to autotuning that searches for the best-performing optimization configuration [58], [59]. This technique is demonstrated to be effective for choosing algorithmic choices [60], tuning GPU code [61], [62], [63], optimizing structured parallel programs [64], [65], [66] and non-uniform memory access (NUMA) architectures [67], and more recently for deep neural networks [68]. Many of the prior works in this area employ an evolutionary-based approach by applying and profiling candidate optimization options to choose a good option to use.…”

Section: Domain-specific Optimizationsmentioning

confidence: 99%

Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures

Zhang

Fang

Yang

et al. 2020

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

As many-core accelerators keep integrating more processing units, it becomes increasingly more difficult for a parallel application to make effective use of all available resources. An effective way for improving hardware utilization is to exploit spatial and temporal sharing of the heterogeneous processing units by multiplexing computation and communication tasks -a strategy known as heterogeneous streaming. Achieving effective heterogeneous streaming requires carefully partitioning hardware among tasks, and matching the granularity of task parallelism to the resource partition. However, finding the right resource partitioning and task granularity is extremely challenging, because there is a large number of possible solutions and the optimal solution varies across programs and datasets. This article presents an automatic approach to quickly derive a good solution for hardware resource partition and task granularity for task-based parallel applications on heterogeneous many-core architectures. Our approach employs a performance model to estimate the resulting performance of the target application under a given resource partition and task granularity configuration. The model is used as a utility to quickly search for a good configuration at runtime. Instead of hand-crafting an analytical model that requires expert insights into low-level hardware details, we employ machine learning techniques to automatically learn it. We achieve this by first learning a predictive model offline using training programs. The learnt model can then be used to predict the performance of any unseen program at runtime. We apply our approach to 39 representative parallel applications and evaluate it on two representative heterogeneous many-core platforms: a CPU-XeonPhi platform and a CPU-GPU platform. Compared to the single-stream version, our approach achieves, on average, a 1.6x and 1.1x speedup on the XeonPhi and the GPU platform, respectively. These results translate to over 93% of the performance delivered by a theoretically perfect predictor.

show abstract

“…For the sake of generality, we take both implementations into account. Then, we implement and run Apollo's object detection module using NVIDIA's CUTLASS [9], an open-source collection of CUDA C++ template abstractions for implementing high-performance matrix-multiplication, and ISAAC [13], an input-aware auto-tuning framework and code-generator for compute-bound HPC kernels.…”

Section: Other Challengesmentioning

confidence: 99%

Assessing the Adherence of an Industrial Autonomous Driving Framework to ISO 26262 Software Guidelines

Tabani

Kosmidis

Abella

et al. 2019

Proceedings of the 56th Annual Design Automation Conference 2019

View full text Add to dashboard Cite

The complexity and size of Autonomous Driving (AD) software are comparably higher than that of software implementing other (standard) functionalities in the car. To make things worse, a big fraction of AD software is not specifically designed for the automotive (or any other critical) domain, but the mainstream market. This brings uncertainty on to which extent AD software adheres to guidelines in safety standards. In this paper, we present our experience in applying ISO 26262-the applicable functional safety standard for road vehicles-software safety guidelines to industrial AD software, in particular, Apollo, a heterogeneous Autonomous Driving framework used extensively in industry. We provide quantitative and qualitative metrics of compliance for many ISO 26262 recommendations on software design, implementation, and testing.

show abstract

Input-aware auto-tuning of compute-bound HPC kernels

Cited by 28 publications

References 18 publications

Runtime Concurrency Control and Operation Scheduling for High Performance Neural Network Training

Runtime Concurrency Control and Operation Scheduling for High Performance Neural Network Training

Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures

Assessing the Adherence of an Industrial Autonomous Driving Framework to ISO 26262 Software Guidelines

Contact Info

Product

Resources

About