Catalyst discovery and optimization is key to solving many societal and energy challenges including solar fuel synthesis, long-term energy storage, and renewable fertilizer production. Despite considerable effort by the catalysis community to apply machine learning models to the computational catalyst discovery process, it remains an open challenge to build models that can generalize across both elemental compositions of surfaces and adsorbate identity/configurations, perhaps because datasets have been smaller in catalysis than in related fields. To address this, we developed the OC20 dataset, consisting of 1,281,040 density functional theory (DFT) relaxations (∼264,890,000 single-point evaluations) across a wide swath of materials, surfaces, and adsorbates (nitrogen, carbon, and oxygen chemistries). We supplemented this dataset with randomly perturbed structures, short timescale molecular dynamics, and electronic structure analyses. The dataset comprises three central tasks indicative of day-to-day catalyst modeling and comes with predefined train/validation/test splits to facilitate direct comparisons with future model development efforts. We applied three state-of-the-art graph neural network models (CGCNN, SchNet, and DimeNet++) to each of these tasks as baseline demonstrations for the community to build on. In almost every task, no upper limit on model size was identified, suggesting that even larger models are likely to improve on initial results. The dataset and baseline models are both provided as open resources as well as a public leader board to encourage community contributions to solve these important tasks.
The development of machine learning models for electrocatalysts requires a broad set of training data to enable their use across a wide variety of materials. One class of materials that currently lacks sufficient training data is oxides, which are critical for the development of Oxygen Evolution Reaction (OER) catalysts. To address this, we developed the Open Catalyst 2022 (OC22) dataset, consisting of 62,331 Density Functional Theory (DFT) relaxations (∼9,854,504 single point calculations) across a range of oxide materials, coverages, and adsorbates. We define generalized total energy tasks that enable property prediction beyond adsorption energies; we test baseline performance of several graph neural networks; and we provide predefined dataset splits to establish clear benchmarks for future efforts. In the most general task, GemNet-OC sees a ∼36% improvement in energy predictions when combining the chemically dissimilar Open Catalyst 2020 Data set (OC20) and OC22 datasets via fine-tuning. Similarly, we achieved a ∼19% improvement in total energy predictions on OC20 and a ∼9% improvement in force predictions in OC22 when using joint training. We demonstrate the practical utility of a top performing model by capturing literature adsorption energies and important OER scaling relationships. We expect OC22 to provide an important benchmark for models seeking to incorporate intricate long-range electrostatic and magnetic interactions in oxide surfaces. Data set and baseline models are open sourced, and a public leaderboard is available to encourage continued community developments on the total energy tasks and data.
Machine learning surrogate models for quantum mechanical simulations have enabled the field to efficiently and accurately study material and molecular systems. Developed models typically rely on a substantial amount of data to make reliable predictions of the potential energy landscape or careful active learning (AL) and uncertainty estimates. When starting with small datasets, convergence of AL approaches is a major outstanding challenge which has limited most demonstrations to online AL. In this work we demonstrate a Δ-machine learning (ML) approach that enables stable convergence in offline AL strategies by avoiding unphysical configurations with initial datasets as little as a single data point. We demonstrate our framework’s capabilities on a structural relaxation, transition state calculation, and molecular dynamics simulation, with the number of first principle calculations being cut down anywhere from 70%–90%. The approach is incorporated and developed alongside AMPtorch, an open-source ML potential package, along with interactive Google Colab notebook examples.
Recent advances in Graph Neural Networks (GNNs) have transformed the space of molecular and catalyst discovery. Even though the underlying physics across these domains remain the same, most prior work has focused on building domain-specific models either in small molecules or in materials. However, building large datasets across domains is computationally expensive, therefore the use of transfer learning (TL) to generalize to different domains is a promising but under-explored approach to this problem. To evaluate this hypothesis, we use a model that is pre-trained on the Open Catalyst Dataset (0C20), and we study the model's behavior when fine-tuned for a set of different datasets and tasks. This includes MD17, CO adsorbate dataset, and across tasks in OC20. Through extensive TL experiments, we demonstrate that initial layers of GNNs learn more basic representation that holds across domains, whereas final layers learn more task-specific features. Moreover, these well-known strategies give significant improvement over the non-pretrained models for in-domain tasks with improvements of 53% and 17% for CO dataset and across OCP task respectively. TL approaches also gives up to 4x speedup in model training depending on the target data and task. However, this does not perform well for MD17 due and gives worse performance than non-pretrained model for few molecules. Based on these observations, we propose TAAG, an attention-based approach that adapts to transfer important features from the interaction layers of GNNs. This method outperforms the best TL approach for MD17 and gives a mean improvement of 6% over the non-pretrained model.
The development of machine-learned potentials for catalyst discovery has predominantly been focused on very specific chemistries and material compositions. While they are effective in interpolating between available materials, these approaches struggle to generalize across chemical space. The recent curation of large-scale catalyst data sets has offered the opportunity to build a universal machine-learning potential, spanning chemical and composition space. If accomplished, said potential could accelerate the catalyst discovery process across a variety of applications (CO2 reduction, NH3 production, etc.) without the additional specialized training efforts that are currently required. The release of the Open Catalyst 2020 Data set (OC20) has begun just that, pushing the heterogeneous catalysis and machine-learning communities toward building more accurate and robust models. In this Perspective, we discuss some of the challenges and findings of recent developments on OC20. We examine the performance of current models across different materials and adsorbates to identify notably underperforming subsets. We then discuss some of the modeling efforts surrounding energy conservation, approaches to finding and evaluating the local minima, and augmentation of off-equilibrium data. To complement the community’s ongoing developments, we end with an outlook to some of the important challenges that have yet to be thoroughly explored for large-scale catalyst discovery.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.