2019 IEEE International Conference on Cluster Computing (CLUSTER) 2019
DOI: 10.1109/cluster.2019.8891023
|View full text |Cite
|
Sign up to set email alerts
|

Efficient User-Level Storage Disaggregation for Deep Learning

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
14
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 30 publications
(14 citation statements)
references
References 32 publications
0
14
0
Order By: Relevance
“…Quiver [29] implements a distributed cache on cloud VMs' local SSD storage, optimizing for random ordering for hyperparameter tuning jobs. Similarly, DeepIO [52] and DLFS [53] leverage hardware support in the form of RDMA and NVMeOF to provide randomized minibatches from storage. DIESEL [47] co-designs storage and caching to provide efficient randomized minibatches for small files.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Quiver [29] implements a distributed cache on cloud VMs' local SSD storage, optimizing for random ordering for hyperparameter tuning jobs. Similarly, DeepIO [52] and DLFS [53] leverage hardware support in the form of RDMA and NVMeOF to provide randomized minibatches from storage. DIESEL [47] co-designs storage and caching to provide efficient randomized minibatches for small files.…”
Section: Related Workmentioning
confidence: 99%
“…At Facebook, our scale presents a host of complex and novel challenges for ML data ingestion. While recent work has explored various isolated components of the data ingestion pipeline such as preprocessing [38], filesystems [53], reading [24,52], or caching [29,36], there is still relatively little understanding of the end-to-end challenges and requirements in industry-scale environments like ours. This paper presents an in-depth analysis of ML data ingestion requirements at scale, and how we architect and optimize our end-to-end data ingestion pipeline for these requirements.…”
Section: Introductionmentioning
confidence: 99%
“…Regarding the issue of I/O and storage for deep learning, both the HPC and deep learning communities have, so far, dedicated most efforts to access large training datasets efficiently [10], [11], [12], [13], while leaving the problem of optimized checkpointing of learning models largely ignored. TensorFlow checkpoints model to files in its SavedModel format, 1 or in HDF5 files through Keras.…”
Section: Related Workmentioning
confidence: 99%
“…DNN model checkpointing: The problem of checkpointing DNN models efficiently is beginning to emerge in deep learning, where most efforts so far focus on efficient access of training batches [28], [29], [30], [31]. TensorFlow checkpoints model to files in its SavedModel format, 1 or in HDF5 files through Keras.…”
Section: Background and Problem Formulationmentioning
confidence: 99%