Metal–organic
frameworks (MOFs) are a class of crystalline
materials composed of metal nodes or clusters connected via semi-rigid
organic linkers. Owing to their high-surface area, porosity, and tunability,
MOFs have received significant attention for numerous applications
such as gas separation and storage. Atomistic simulations and data-driven
methods [e.g., machine learning (ML)] have been successfully employed
to screen large databases and successfully develop new experimentally
synthesized and validated MOFs for CO2 capture. To enable
data-driven materials discovery for any application, the first (and
arguably most crucial) step is database curation. This work introduces
the ab initio REPEAT charge MOF (ARC–MOF) database. This is
a database of ∼280,000 MOFs which have been either experimentally
characterized or computationally generated, spanning all publicly
available MOF databases. A key feature of ARC–MOF is that it
contains density functional theory-derived electrostatic potential
fitted partial atomic charges for each MOF. Additionally, ARC–MOF
contains pre-computed descriptors for out-of-the-box ML applications.
An in-depth analysis of the diversity of ARC–MOF with respect
to the currently mapped design space of MOFs was performeda
critical, yet commonly overlooked aspect of previously reported MOF
databases. Using this analysis, balanced subsets from ARC–MOF
for various ML purposes have been identified, with a case study of
the effect of training set on the ML performance. Other chemical and
geometric diversity analyses are presented, with an analysis on the
effect of the charge-assignment method on atomistic simulation of
the gas uptake in MOFs.