Catalyst discovery and optimization is key to solving many societal and energy challenges including solar fuel synthesis, long-term energy storage, and renewable fertilizer production. Despite considerable effort by the catalysis community to apply machine learning models to the computational catalyst discovery process, it remains an open challenge to build models that can generalize across both elemental compositions of surfaces and adsorbate identity/configurations, perhaps because datasets have been smaller in catalysis than in related fields. To address this, we developed the OC20 dataset, consisting of 1,281,040 density functional theory (DFT) relaxations (∼264,890,000 single-point evaluations) across a wide swath of materials, surfaces, and adsorbates (nitrogen, carbon, and oxygen chemistries). We supplemented this dataset with randomly perturbed structures, short timescale molecular dynamics, and electronic structure analyses. The dataset comprises three central tasks indicative of day-to-day catalyst modeling and comes with predefined train/validation/test splits to facilitate direct comparisons with future model development efforts. We applied three state-of-the-art graph neural network models (CGCNN, SchNet, and DimeNet++) to each of these tasks as baseline demonstrations for the community to build on. In almost every task, no upper limit on model size was identified, suggesting that even larger models are likely to improve on initial results. The dataset and baseline models are both provided as open resources as well as a public leader board to encourage community contributions to solve these important tasks.
The surface energy of inorganic crystals is important in understanding experimentally relevant surface properties and designing materials for many applications. Predictive methods and data sets exist for surface energies of monometallic crystals. However, predicting these properties for bimetallic or more complicated surfaces is an open challenge. Computing cleavage energy is the first step in calculating surface energy across a large space. Here, we present a workflow to predict cleavage energies ab initio using high-throughput DFT and a machine learning framework. We calculated the cleavage energy of 3033 intermetallic alloys with combinations of 36 elements and 47 space groups. This high-throughput workflow was used to seed a database of cleavage energies. The database was used to train a crystal graph convolutional neural network (CGCNN). The CGCNN model provides an accurate prediction of cleavage energy with a mean absolute test error of 0.0071 eV/Å2. It can also qualitatively reproduce nanoparticle surface distributions (Wulff constructions). Our workflow provides quantitative insights into unexplored chemical space by predicting which surfaces are relatively stable and therefore more realistic. The insights allow us to down-select interesting candidates that we can study with robust theoretical and experimental methods for applications such as catalyst screening and nanomaterials synthesis.
The rising application of informatics and data science tools for studying inorganic crystals and small molecules has revolutionized approaches to materials discovery and driven the development of accurate machine learning structure/property relationships. We discuss how informatics tools can accelerate research, and we present various combinations of workflows, databases, and surrogate models in the literature. This paradigm has been slower to infiltrate the catalysis community due to larger configuration spaces, difficulty in describing necessary calculations, and thermodynamic/kinetic quantities that require many interdependent calculations. We present our own informatics tool that uses dynamic dependency graphs to share, organize, and schedule calculations to enable new, flexible research workflows in surface science. This approach is illustrated for the large-scale screening of intermetallic surfaces for electrochemical catalyst activity. Similar approaches will be important to bring the benefits of informatics and data science to surface science research. Lastly, we provide our perspective on when to use these tools and considerations when creating them.
The development of machine-learned potentials for catalyst discovery has predominantly been focused on very specific chemistries and material compositions. While they are effective in interpolating between available materials, these approaches struggle to generalize across chemical space. The recent curation of large-scale catalyst data sets has offered the opportunity to build a universal machine-learning potential, spanning chemical and composition space. If accomplished, said potential could accelerate the catalyst discovery process across a variety of applications (CO2 reduction, NH3 production, etc.) without the additional specialized training efforts that are currently required. The release of the Open Catalyst 2020 Data set (OC20) has begun just that, pushing the heterogeneous catalysis and machine-learning communities toward building more accurate and robust models. In this Perspective, we discuss some of the challenges and findings of recent developments on OC20. We examine the performance of current models across different materials and adsorbates to identify notably underperforming subsets. We then discuss some of the modeling efforts surrounding energy conservation, approaches to finding and evaluating the local minima, and augmentation of off-equilibrium data. To complement the community’s ongoing developments, we end with an outlook to some of the important challenges that have yet to be thoroughly explored for large-scale catalyst discovery.
Scalable and cost-effective solutions to renewable energy storage are essential to addressing the world's rising energy needs while reducing climate change. As we increase our reliance on renewable energy sources such as wind and solar, which produce intermittent power, storage is needed to transfer power from times of peak generation to peak demand. This may require the storage of power for hours, days, or months. One solution that offers the potential of scaling to nation-sized grids is the conversion of renewable energy to other fuels, such as hydrogen or methane. To be widely adopted, this process requires cost-effective solutions to running electrochemical reactions. An open challenge is finding low-cost electrocatalysts to drive these reactions at high rates. Through the use of quantum mechanical simulations (density functional theory), new catalyst structures can be tested and evaluated. Unfortunately, the high computational cost of these simulations limits the number of structures that may be tested. The use of machine learning may provide a method to efficiently approximate these calculations, leading to new approaches in finding effective electrocatalysts. In this paper, we provide an introduction to the challenges in finding suitable electrocatalysts, how machine learning may be applied to the problem, and the use of the Open Catalyst Project OC20 dataset for model training.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.