Abstract-Today's high-performance computing (HPC) systems are heavily instrumented, generating logs containing information about abnormal events, such as critical conditions, faults, errors and failures, system resource utilization, and about the resource usage of user applications. These logs, once fully analyzed and correlated, can produce detailed information about the system health, root causes of failures, and analyze an application's interactions with the system, providing valuable insights to domain scientists and system administrators. However, processing HPC logs requires a deep understanding of hardware and software components at multiple layers of the system stack. Moreover, most log data is unstructured and voluminous, making it more difficult for system users and administrators to manually inspect the data. With rapid increases in the scale and complexity of HPC systems, log data processing is becoming a big data challenge. This paper introduces a HPC log data analytics framework that is based on a distributed NoSQL database technology, which provides scalability and high availability, and the Apache Spark framework for rapid in-memory processing of the log data. The analytics framework enables the extraction of a range of information about the system so that system administrators and end users alike can obtain necessary insights for their specific needs. We describe our experience with using this framework to glean insights from the log data about system behavior from the Titan supercomputer at the Oak Ridge National Laboratory.
Users were asked to rate their satisfaction on a 5-point scale, where a score of 5 indicates a rating of "very satisfied," and a score of 1 indicates a rating of "very dissatisfied." The metrics were agreed on by the Department of Energy (DOE) and OLCF program manager, who defined 3.5/5.0 as satisfactory. Overall ratings for the OLCF were positive, with 96% of users responding that they were satisfied or very satisfied with the OLCF overall. Key indicators from the survey, including overall satisfaction, are shown in Table 1.3. They are summarized and presented by program respondents. The data show that satisfaction among all allocation programs is similar for the four key satisfaction indicators. Table 1.7. Applications in the Center for Accelerated Application Readiness (CAAR) Application Principal investigator CAAR liaison Scientific discipline
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.