2019
DOI: 10.1145/3331526
|View full text |Cite
|
Sign up to set email alerts
|

Scalable Deep Learning via I/O Analysis and Optimization

Abstract: Scalable deep neural network training has been gaining prominence because of the increasing importance of deep learning in a multitude of scientific and commercial domains. Consequently, a number of researchers have investigated techniques to optimize deep learning systems. Much of the prior work has focused on runtime and algorithmic enhancements to optimize the computation and communication. Despite these enhancements, however, deep learning systems still suffer from scalability limitations, particularly wit… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
10
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
2
2

Relationship

0
9

Authors

Journals

citations
Cited by 34 publications
(10 citation statements)
references
References 19 publications
0
10
0
Order By: Relevance
“…For example, the lightweight Lightning Memory-Mapped Database (LMDB) maps content directly into memory (thus taking advantage of OS-level I/O optimizations) and uses B+trees to index it (thus reducing metadata overheads). However, Pumma et al [17] have shown that this solution does not mitigate the problem sufficiently, as I/O overheads still dominate training (up to 90%) even for only a small degree of parallelism. Other approaches such as FanStore [26] provide a global cache layer on node-local burst buffers in a compressed format, allowing POSIX-compliant file access to the compressed data in user space.…”
Section: Related Workmentioning
confidence: 99%
“…For example, the lightweight Lightning Memory-Mapped Database (LMDB) maps content directly into memory (thus taking advantage of OS-level I/O optimizations) and uses B+trees to index it (thus reducing metadata overheads). However, Pumma et al [17] have shown that this solution does not mitigate the problem sufficiently, as I/O overheads still dominate training (up to 90%) even for only a small degree of parallelism. Other approaches such as FanStore [26] provide a global cache layer on node-local burst buffers in a compressed format, allowing POSIX-compliant file access to the compressed data in user space.…”
Section: Related Workmentioning
confidence: 99%
“…In other words, it becomes an inputbound application if the I/O system does not keep up with the high computational performance. Previous studies have shown that I/O can account for as much as 90% of the total training time [6]. Unlike traditional HPC collective I/O, where processes rearrange I/O operations through a communicator to maximize bandwidth and minimize metadata operation, ML I/O uses an independent I/O strategy.…”
Section: Background and Motivationmentioning
confidence: 99%
“…The high-pace data ingestion heavily stresses the I/O system. Previous works [6]- [9] have characterized I/O performance in large-scale ML workloads and showed that without an efficient data preprocessing pipeline, ML workloads are highly input-bound.…”
Section: Introductionmentioning
confidence: 99%
“…Our work attempts to reduce the storage bottleneck altogether, such that a single disk could potentially service many GPUs. A separate line of work shows that I/O is a significant bottleneck for certain tasks and proposes optimizing I/O via a set of deep-learning specific optimization to LMDB (Pumma et al, 2019). In contrast, our focus is more on data representation, which is agnostic of the storage system.…”
Section: Related Workmentioning
confidence: 99%