We present MONARCH, a framework-agnostic storage middleware that transparently employs storage tiering to accelerate Deep Learning (DL) training. It leverages existing storage tiers of modern supercomputers (i.e., compute node's local storage and shared parallel file system (PFS)), while considering the I/O patterns of DL frameworks to improve data placement across tiers. MONARCH aims at accelerating DL training and decreasing the I/O pressure imposed over the PFS. We apply MONARCH to TensorFlow and PyTorch, while validating its performance and applicability under different models and dataset sizes. Results show that, even when the training dataset can only be partially stored at local storage, MONARCH reduces TensorFlow's and PyTorch's training time by up to 28% and 37% for I/O-intensive models, respectively. Furthermore, MONARCH decreases the number of I/O operations submitted to the PFS by up to 56%.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.