Parallel applications running across thousands of processors must protect themselves from inevitable system failures. Many applications insulate themselves from failures by checkpointing. For many applications, checkpointing into a shared single file is most convenient. With such an approach, the size of writes are often small and not aligned with file system boundaries. Unfortunately for these applications, this preferred data layout results in pathologically poor performance from the underlying file system which is optimized for large, aligned writes to non-shared files. To address this fundamental mismatch, we have developed a virtual parallel log structured file system, PLFS. PLFS remaps an application's preferred data layout into one which is optimized for the underlying file system. Through testing on PanFS, Lustre, and GPFS, we have seen that this layer of indirection and reorganization can reduce checkpoint time by an order of magnitude for several important benchmarks and real applications without any application modification.1: Summary of our results. This graph summarizes our results which will be explained in detail in Section 4. The key observation here is that our technique has improved checkpoint bandwidths for all seven studied benchmarks and applications by up to several orders of magnitude.
There is high demand for I/O tracing in High Performance Computing (HPC). It enables in-depth analysis of distributed applications and file system performance tuning. It also aids distributed application debugging. Finally, it facilitates collaboration within and between government, industrial, and academic institutions by enabling the generation of replayable I/O traces, which can be easily distributed and anonymized as necessary to protect confidential or sensitive information. As a response to this demand for tracing tools, various means of I/O trace generation exist. We first survey the I/O Tracing Framework landscape, exploring three popular such frameworks: LANLTrace [3], Tracefs [1], and //TRACE 1 [2]. We next develop an I/O Tracing Framework taxonomy. The purpose of this taxonomy is to assist I/O Tracing Framework users in formalizing their tracing requirements, and to provide the developers of I/O Tracing Frameworks a language to categorize the functionality and performance of them. The taxonomy categorizes I/O Tracing Framework features such as the type of data captured, trace replayability, and anonymization. The taxonomy also considers elapsed-time overhead and performance overhead. Finally, we provide a case study in the use of our new taxonomy, revisiting all three I/O Tracing Frameworks explored in our survey, to formally classify the features of each.
Current, emerging, and future NVM (non-volatile memory) technologies give us hope that we will be able to architect HPC (high performance computing) systems that initially use them in a memory and storage hierarchy, and eventually use them as the memory and storage for the system, complete with ownership and protections as a HDDbased (hard-disk-drive-based) file system provides today.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.