As one core component of high-performance computing (HPC) platforms, parallel file systems (PFSes) grow quickly in scale and complexity, which makes them vulnerable to various failures or anomalies. Identifying PFS anomalies in runtime is thus critically helpful for HPC users and administrators. Analyzing runtime logs to detect the anomalies of large-scale systems has been proven effective in many recent studies. However, applying existing log analysis to PFSes faces significant challenges due to the large volume and irregularity of PFS logs. This study proposes SentiLog, a new approach to analyzing PFS logs for detecting anomalies. Unlike existing solutions, SentiLog works by training a general sentimental, natural language model based on the logging-relevant source code collected from a set of PFSes. In this way, SentiLog learns the implicit semantic information embedded in PFS by developers. Our preliminary results show that SentiLog can accurately predict anomalies and perform better than state-of-the-art log analysis solutions on two representative PFSes (i.e., Lustre and BeeGFS). To the best of our knowledge, this is the first work demonstrating that sentiment analysis could be a promising method to analyze complex and irregular system logs. CCS CONCEPTS• Software and its engineering → Software maintenance tools; • Computing methodologies → Natural language processing; • Computer systems organization → Dependable and fault-tolerant systems and networks.
Large-scale parallel file systems (PFSes) play an essential role in high performance computing (HPC). However, despite the importance, their reliability is much less studied or understood compared with that of local storage systems or cloud storage systems. Recent failure incidents at real HPC centers have exposed the latent defects in PFS clusters as well as the urgent need for a systematic analysis. To address the challenge, we perform a study of the failure recovery and logging mechanisms of PFSes in this paper. First, to trigger the failure recovery and logging operations of the target PFS, we introduce a black-box fault injection tool called PFault , which is transparent to PFSes and easy to deploy in practice. PFault emulates the failure state of individual storage nodes in the PFS based on a set of pre-defined fault models, and enables examining the PFS behavior under fault systematically. Next, we apply PFault to study two widely used PFSes: Lustre and BeeGFS. Our analysis reveals the unique failure recovery and logging patterns of the target PFSes, and identifies multiple cases where the PFSes are imperfect in terms of failure handling. For example, Lustre includes a recovery component called LFSCK to detect and fix PFS-level inconsistencies, but we find that LFSCK itself may hang or trigger kernel panics when scanning a corrupted Lustre. Even after the recovery attempt of LFSCK, the subsequent workloads applied to Lustre may still behave abnormally (e.g., hang or report I/O errors). Similar issues have also been observed in BeeGFS and its recovery component BeeGFS-FSCK. We analyze the root causes of the abnormal symptoms observed in depth, which has led to a new patch set to be merged into the coming Lustre release. In addition, we characterize the extensive logs generated in the experiments in details, and identify the unique patterns and limitations of PFSes in terms of failure logging. We hope this study and the resulting tool and dataset can facilitate follow-up research in the communities and help improve PFSes for reliable high-performance computing.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.