Parallel applications running across thousands of processors must protect themselves from inevitable system failures. Many applications insulate themselves from failures by checkpointing. For many applications, checkpointing into a shared single file is most convenient. With such an approach, the size of writes are often small and not aligned with file system boundaries. Unfortunately for these applications, this preferred data layout results in pathologically poor performance from the underlying file system which is optimized for large, aligned writes to non-shared files. To address this fundamental mismatch, we have developed a virtual parallel log structured file system, PLFS. PLFS remaps an application's preferred data layout into one which is optimized for the underlying file system. Through testing on PanFS, Lustre, and GPFS, we have seen that this layer of indirection and reorganization can reduce checkpoint time by an order of magnitude for several important benchmarks and real applications without any application modification.1: Summary of our results. This graph summarizes our results which will be explained in detail in Section 4. The key observation here is that our technique has improved checkpoint bandwidths for all seven studied benchmarks and applications by up to several orders of magnitude.
PSC has architected and delivered the TCS-1 machine, a Terascale Computing System for use in unclassified research. PSC has enhanced the effective usability and utilization of this resource by providing custom I/O solutions in four key areas: high-performance communication, highperformance file migration, checkpoint/recovery and an updated hierarchical storage management system. These I/O solutions have a synergistic effect that is leveraged in their design, implementation and integration. Each successive enhancement builds on its predecessors, thereby exacting the highest performance (e.g. multi GB/sec file transfers) from the available hardware. This paper presents a technical overview of these solutions from design to integration to application.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.