Abstract. Modern scientific computing generates petabytes of data in billions of files that must be managed. These files are often organized, by name, in a hierarchical directory tree common to most file systems. As the scale of data has increased, this has proven to be a poor method of file organization. Recent tools have allowed for users to navigate files based on file metadata attributes to provide more meaningful organization. In order to search this metadata, it is often stored on separate metadata servers. This solution has drawbacks though due to the multi-tiered architecture of many large scale storage solutions. As data is moved between various tiers of storage and/or modified, the overhead incurred for maintaining consistency between these tiers and the metadata server becomes very large. As scientific systems continue to push towards exascale, this problem will become more pronounced. A simpler option is to bypass the overhead of the metadata server and use the metadata storage inherent to the file system. This approach currently has few tools to perform operations at a large scale though. This paper introduces the prototype for Pantheon, a file system search tool designed to use the metadata storage within the file system itself, bypassing the overhead from metadata servers. Pantheon is also designed with the scientific community's push towards exascale computing in mind. Pantheon combines hierarchical partitioning, query optimization, and indexing to perform efficient metadata searches over large scale file systems.
As the rate at which scientific computing generates data continues to increase, we are finding that managing this data, in all facets, is quickly becoming more challenging. In many facilities with large scale storage needs, this massive amount of data is stored in distributed, multi-tiered storage systems. It has become imperative to allow for efficient and effective search within these kinds of environments. For some search problems, specifically file system metadata, traditional relational databases have been used with, initially, good results. As the scale of supercomputing has grown though, we find that it is becoming increasing difficult for databases to scale up with the volume of metadata that they are managing. In this paper, we propose a new direction for database management techniques within the context of high performance computing, specifically, search within ultrascale storage systems. Instead of using databases as a layer sitting above the storage system, we suggest the movement of database components within the storage system itself. By taking this approach, we aim to leverage the decades of research and tuning that have made relational database technology successful. At the same time, this integration gives us the ability to maintain a better view of the storage system for search optimization. Through this effort, we can position these techniques to better scale to the degree that is required by the high performance computing community currently, and in the future.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.