The development of machine learning models for electrocatalysts requires a broad set of training data to enable their use across a wide variety of materials. One class of materials that currently lacks sufficient training data is oxides, which are critical for the development of Oxygen Evolution Reaction (OER) catalysts. To address this, we developed the Open Catalyst 2022 (OC22) dataset, consisting of 62,331 Density Functional Theory (DFT) relaxations (∼9,854,504 single point calculations) across a range of oxide materials, coverages, and adsorbates. We define generalized total energy tasks that enable property prediction beyond adsorption energies; we test baseline performance of several graph neural networks; and we provide predefined dataset splits to establish clear benchmarks for future efforts. In the most general task, GemNet-OC sees a ∼36% improvement in energy predictions when combining the chemically dissimilar Open Catalyst 2020 Data set (OC20) and OC22 datasets via fine-tuning. Similarly, we achieved a ∼19% improvement in total energy predictions on OC20 and a ∼9% improvement in force predictions in OC22 when using joint training. We demonstrate the practical utility of a top performing model by capturing literature adsorption energies and important OER scaling relationships. We expect OC22 to provide an important benchmark for models seeking to incorporate intricate long-range electrostatic and magnetic interactions in oxide surfaces. Data set and baseline models are open sourced, and a public leaderboard is available to encourage continued community developments on the total energy tasks and data.
The development of machine-learned potentials for catalyst discovery has predominantly been focused on very specific chemistries and material compositions. While they are effective in interpolating between available materials, these approaches struggle to generalize across chemical space. The recent curation of large-scale catalyst data sets has offered the opportunity to build a universal machine-learning potential, spanning chemical and composition space. If accomplished, said potential could accelerate the catalyst discovery process across a variety of applications (CO2 reduction, NH3 production, etc.) without the additional specialized training efforts that are currently required. The release of the Open Catalyst 2020 Data set (OC20) has begun just that, pushing the heterogeneous catalysis and machine-learning communities toward building more accurate and robust models. In this Perspective, we discuss some of the challenges and findings of recent developments on OC20. We examine the performance of current models across different materials and adsorbates to identify notably underperforming subsets. We then discuss some of the modeling efforts surrounding energy conservation, approaches to finding and evaluating the local minima, and augmentation of off-equilibrium data. To complement the community’s ongoing developments, we end with an outlook to some of the important challenges that have yet to be thoroughly explored for large-scale catalyst discovery.
Recent advances in Graph Neural Networks (GNNs) have transformed the space of molecular and catalyst discovery. Even though the underlying physics across these domains remain the same, most prior work has focused on building domain-specific models either in small molecules or in materials. However, building large datasets across domains is computationally expensive, therefore the use of transfer learning (TL) to generalize to different domains is a promising but under-explored approach to this problem. To evaluate this hypothesis, we use a model that is pre-trained on the Open Catalyst Dataset (0C20), and we study the model's behavior when fine-tuned for a set of different datasets and tasks. This includes MD17, CO adsorbate dataset, and across tasks in OC20. Through extensive TL experiments, we demonstrate that initial layers of GNNs learn more basic representation that holds across domains, whereas final layers learn more task-specific features. Moreover, these well-known strategies give significant improvement over the non-pretrained models for in-domain tasks with improvements of 53% and 17% for CO dataset and across OCP task respectively. TL approaches also gives up to 4x speedup in model training depending on the target data and task. However, this does not perform well for MD17 due and gives worse performance than non-pretrained model for few molecules. Based on these observations, we propose TAAG, an attention-based approach that adapts to transfer important features from the interaction layers of GNNs. This method outperforms the best TL approach for MD17 and gives a mean improvement of 6% over the non-pretrained model.
Computational catalysis and machine learning communities have made considerable progress in developing machine learning models for catalyst discovery and design. Yet, a general machine learning potential that spans the chemical space of catalysis is still out of reach. A significant hurdle is obtaining access to training data across a wide range of materials. One important class of materials where data is lacking are oxides, which inhibits models from studying the Oxygen Evolution Reaction (OER) and oxide electrocatalysis more generally. To address this we developed the Open Catalyst 2022 (OC22) dataset, consisting of 62,521 Density Functional Theory (DFT) relaxations (∼9,884,504 single point calculations) across a range of oxide materials, coverages, and adsorbates (*H, *O, *N, *C, *OOH, *OH, *OH 2 , *O 2 , *CO). We define generalized tasks to predict the total system energy that are applicable across catalysis, develop baseline performance of several graph neural networks (SchNet, DimeNet++, ForceNet, SpinConv, PaiNN, GemNet-dT, GemNet-OC), and provide pre-defined dataset splits to establish clear benchmarks for future efforts. For all tasks, we study whether combining datasets leads to better results, even if they contain different materials or adsorbates. Specifically, we jointly train models on Open Catalyst 2020 Dataset (OC20) and OC22, or fine-tune pretrained OC20 models on OC22. In the most general task, GemNet-OC sees a ∼32% improvement in energy predictions through fine-tuning and a ∼9% improvement in force predictions via joint training. Surprisingly, joint training on both the OC20 and much smaller OC22 datasets also improves total energy predictions on OC20 by ∼19%. The dataset and baseline models are open sourced, and a public leaderboard will follow to encourage continued community developments on the total energy tasks and data.
Modeling the energy and forces of atomic systems is a fundamental problem in computational chemistry with the potential to help address many of the world's most pressing problems, including those related to energy scarcity and climate change. These calculations are traditionally performed using Density Functional Theory, which is computationally very expensive. Machine learning has the potential to dramatically improve the efficiency of these calculations from days or hours to seconds. We propose the Spherical Channel Network (SCN) to model atomic energies and forces. The SCN is a graph neural network where nodes represent atoms and edges their neighboring atoms. The atom embeddings are a set of spherical functions, called spherical channels, represented using spherical harmonics. We demonstrate, that by rotating the embeddings based on the 3D edge orientation, more information may be utilized while maintaining the rotational equivariance of the messages. While equivariance is a desirable property, we find that by relaxing this constraint in both message passing and aggregation, improved accuracy may be achieved. We demonstrate state-of-the-art results on the large-scale Open Catalyst 2020 dataset in both energy and force prediction for numerous tasks and metrics.Preprint. Under review.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.