Scalable Deep Learning via I/O Analysis and Optimization

Nicolae

Proceedings of the 12th Workshop on AI and Scientific Computing at Scale Using Flexible Computing Infrastructures

et al. 2022

The training of deep neural network models on large data remains a difficult problem, despite progress towards scalable techniques. In particular, there is a mismatch between the random but predetermined order in which AI flows select training samples and the streaming I/O patterns for which traditional HPC data storage (e.g., parallel file systems) are designed. In addition, as more data are obtained, it is feasible neither simply to train learning models incrementally, due to catastrophic forgetting (i.e., bias towards new samples), nor to train frequently from scratch, due to prohibitive time and/or resource constraints. In this paper, we study data management techniques that combine caching and streaming with rehearsal support in order to enable efficient access to training samples in both offline training and continual learning. We revisit state-of-art streaming approaches based on data pipelines that transparently handle prefetching, caching, shuffling, and data augmentation, and discuss the challenges and opportunities that arise when combining these methods with data-parallel training techniques. We also report on preliminary experiments that evaluate the I/O overheads involved in accessing the training samples from a parallel file system (PFS) under several concurrency scenarios, highlighting the impact of the PFS on the design of the data pipelines.

Section: Related Workmentioning

confidence: 99%

Large Scale Caching and Streaming of Training Data for Online Deep Learning

Nicolae

Proceedings of the 12th Workshop on AI and Scientific Computing at Scale Using Flexible Computing Infrastructures

et al. 2022

2020 IEEE International Conference on Cluster Computing (CLUSTER)

“…In other words, it becomes an inputbound application if the I/O system does not keep up with the high computational performance. Previous studies have shown that I/O can account for as much as 90% of the total training time [6]. Unlike traditional HPC collective I/O, where processes rearrange I/O operations through a communicator to maximize bandwidth and minimize metadata operation, ML I/O uses an independent I/O strategy.…”

Section: Background and Motivationmentioning

confidence: 99%

“…The high-pace data ingestion heavily stresses the I/O system. Previous works [6]- [9] have characterized I/O performance in large-scale ML workloads and showed that without an efficient data preprocessing pipeline, ML workloads are highly input-bound.…”

Section: Introductionmentioning

confidence: 99%

tf-Darshan: Understanding Fine-grained I/O Performance in Machine Learning Workloads

Chien

Podobas

2020

Machine Learning applications on HPC systems have been gaining popularity in recent years. The upcoming large scale systems will offer tremendous parallelism for training through GPUs. However, another heavy aspect of Machine Learning is I/O, and this can potentially be a performance bottleneck. TensorFlow, one of the most popular Deep-Learning platforms, now offers a new profiler interface and allows instrumentation of TensorFlow operations. However, the current profiler only enables analysis at the TensorFlow platform level and does not provide system-level information. In this paper, we extend TensorFlow Profiler and introduce tf-Darshan, both a profiler and tracer, that performs instrumentation through Darshan. We use the same Darshan shared instrumentation library and implement a runtime attachment without using a system preload. We can extract Darshan profiling data structures during TensorFlow execution to enable analysis through the TensorFlow profiler. We visualize the performance results through TensorBoard, the web-based TensorFlow visualization tool. At the same time, we do not alter Darshan's existing implementation. We illustrate tf-Darshan by performing two case studies on ImageNet image and Malware classification. We show that by guiding optimization using data from tf-Darshan, we increase POSIX I/O bandwidth by up to 19% by selecting data for staging on fast tier storage. We also show that Darshan has the potential of being used as a runtime library for profiling and providing information for future optimization.

“…Our work attempts to reduce the storage bottleneck altogether, such that a single disk could potentially service many GPUs. A separate line of work shows that I/O is a significant bottleneck for certain tasks and proposes optimizing I/O via a set of deep-learning specific optimization to LMDB (Pumma et al, 2019). In contrast, our focus is more on data representation, which is agnostic of the storage system.…”

Section: Related Workmentioning

confidence: 99%

Progressive Compressed Records: Taking a Byte out of Deep Learning Data

Kuchnik,

Amvrosiadis,

Smith

2019

Preprint

Deep learning training accesses vast amounts of data at high velocity, posing challenges for datasets retrieved over commodity networks and storage devices. We introduce a way to dynamically reduce the overhead of fetching and transporting training data with a method we term Progressive Compressed Records (PCRs). PCRs deviate from previous formats by using progressive compression to convert a single dataset into multiple datasets of increasing fidelity-all without adding to the total dataset size. Empirically, we implement PCRs and evaluate them on a wide range of datasets: ImageNet, HAM10000, Stanford Cars, and CelebA-HQ. Our results show that different tasks can tolerate different levels of compression. PCRs use an on-disk layout that enables applications to efficiently and dynamically access appropriate levels of compression at runtime. In turn, we demonstrate that PCRs can seamlessly enable a 2× speedup in training time on average over baseline formats.