2020
DOI: 10.48550/arxiv.2007.06775
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Analyzing and Mitigating Data Stalls in DNN Training

Abstract: Training Deep Neural Networks (DNNs) is resource-intensive and time-consuming. While prior research has explored many different ways of reducing DNN training time spanning different layers of the system stack, the impact of input data pipeline, i.e., fetching raw data items from storage and performing data pre-processing in memory, has been relatively unexplored.This paper makes the following contributions: (1) We present the first comprehensive analysis of how the input data pipeline affects the training time… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
10
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 7 publications
(11 citation statements)
references
References 39 publications
1
10
0
Order By: Relevance
“…An input bottleneck occurs when the input pipeline is not able to generate batches of training examples as fast as the training computation can consume them. If the time spent waiting for the input pipeline exceeds tens of microseconds on average, the input pipeline is not keeping up with model training, causing a data stall (Mohan et al, 2020) The current practice of pipeline tuning, which optimizes the throughput (rate) of the pipeline, is explained below.…”
Section: Understanding Input Bottlenecksmentioning
confidence: 99%
See 2 more Smart Citations
“…An input bottleneck occurs when the input pipeline is not able to generate batches of training examples as fast as the training computation can consume them. If the time spent waiting for the input pipeline exceeds tens of microseconds on average, the input pipeline is not keeping up with model training, causing a data stall (Mohan et al, 2020) The current practice of pipeline tuning, which optimizes the throughput (rate) of the pipeline, is explained below.…”
Section: Understanding Input Bottlenecksmentioning
confidence: 99%
“…Dataset Echoing (Choi et al, 2019) repeats input pipeline operations to match the rate of input pipeline with compute steps. DS-Analyzer predicts how much file cache memory is necessary to match the compute steps (Mohan et al, 2020). Progressive Compressed Records (Kuchnik et al, 2019) match compression levels to likewise minimize I/O.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…This pipeline overlaps with the training itself, aiming to improve resource utilization and hide the overhead of assembling the minibatches. However, despite this overlap, data loading and preprocessing often become a bottleneck [8,19,26,30,39], with reports of overheads of up to 72% of end-to-end training time [26,30]. With ever-increasing accumulation of training data, data loading is likely to become yet more costly, prompting the need for scalable solutions to mitigate these overheads.…”
Section: Introductionmentioning
confidence: 99%
“…Storage capacity and bandwidth requirements (bytes moved from storage to compute device per image) also scale quadratically with image resolution, affecting the monetary cost (both storage and network usage are billed) of inference in real-world datacenter or cloud deployments where a separate storage cluster is usually used to store and forward input data through the network [21]. As a result, DNN training is frequently dominated by data stall time, which happens both remotely and locally, and can be due to CPU decoding overhead [20].…”
Section: Introductionmentioning
confidence: 99%