2023
DOI: 10.1038/s41597-022-01882-6
|View full text |Cite
|
Sign up to set email alerts
|

SPICE, A Dataset of Drug-like Molecules and Peptides for Training Machine Learning Potentials

Abstract: Machine learning potentials are an important tool for molecular simulation, but their development is held back by a shortage of high quality datasets to train them on. We describe the SPICE dataset, a new quantum chemistry dataset for training potentials relevant to simulating drug-like small molecules interacting with proteins. It contains over 1.1 million conformations for a diverse set of small molecules, dimers, dipeptides, and solvated amino acids. It includes 15 elements, charged and uncharged molecules,… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
86
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 67 publications
(86 citation statements)
references
References 47 publications
0
86
0
Order By: Relevance
“…This chemical space is incomplete, as many drug molecules contain phosphorus, sulfur, and halogen atoms, and some contain metal ions. The ANI-2x model was extended to include S, F, and Cl, but the full data set, including the important reference energies and forces at the ωB97X/6-31G* level, to our knowledge, has not yet become publicly available. Currently, there are a number of recent data sets that include compounds that contain phosphorus, sulfur, and halogens at various levels of theory as well as metal ions . Among them, only the SPICE data set includes forces at the ωB97M-D3BJ/def2-TZVPPD level and currently includes over 420K phosphorus, 520K sulfur, 750K halogen, and 8K metal-containing structures.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…This chemical space is incomplete, as many drug molecules contain phosphorus, sulfur, and halogen atoms, and some contain metal ions. The ANI-2x model was extended to include S, F, and Cl, but the full data set, including the important reference energies and forces at the ωB97X/6-31G* level, to our knowledge, has not yet become publicly available. Currently, there are a number of recent data sets that include compounds that contain phosphorus, sulfur, and halogens at various levels of theory as well as metal ions . Among them, only the SPICE data set includes forces at the ωB97M-D3BJ/def2-TZVPPD level and currently includes over 420K phosphorus, 520K sulfur, 750K halogen, and 8K metal-containing structures.…”
Section: Resultsmentioning
confidence: 99%
“…Currently, there are a number of recent data sets that include compounds that contain phosphorus, sulfur, and halogens at various levels of theory as well as metal ions . Among them, only the SPICE data set includes forces at the ωB97M-D3BJ/def2-TZVPPD level and currently includes over 420K phosphorus, 520K sulfur, 750K halogen, and 8K metal-containing structures.…”
Section: Resultsmentioning
confidence: 99%
“…To curate a dataset representing the chemical space of interest for biophysical modeling of biomolecules and drug-like small molecules, we use the SPICE [12] dataset, enumerating reasonable protonation and tautomeric states with the OpenEye Toolkit. We generated AM1-BCC ELF10 charges for each of these molecules using the OpenEye Toolkit, and trained EspalomaCharge (Figure 1) to reproduce the partial atomic charges with a squared loss function.…”
Section: The Spice Dataset Covers Biochemically and Biophysically Int...mentioning
confidence: 99%
“…In this paper, we use the continuous embedding atom representation scheme from Espaloma in conjunction with analytical constrained charge assignment inspired by charge equilibration to come up with an ultra-fast machine learning surrogate for partial charge assignment (EspalomaCharge). We train Espalo-maCharge on an expanded set of protonation states and tautomers of representative biomolecules and druglike molecules (the SPICE dataset [12]) to assign high-quality AM1-BCC ELF10 charges [26]. The resulting EspalomaCharge model accurately reproduces AM1-BCC ELF10 charges to an error well within the discrepancy between AmberTools sqm and OpenEye oequacpac implementations on average 2,000 times faster than AmberTools on the SPICE dataset, can utilize either CPU or GPU, and scales as ( ) with number of atoms, allowing even entire proteins to be assigned AM1-BCC equivalent charges.…”
mentioning
confidence: 99%
“…In order to produce quantitative results and stable dynamics in this context, next iterations of the FENNIX model will need to explicitly handle charged systems, which will require a more thorough dataset. The recently introduced SPICE dataset [81] could be a good complement to the ones already used in this work as it provides reference data for many charged molecules and focuses on biological systems (however at the lower qual- ity DFT level). Furthermore, it will require finer description of molecular interactions through the inclusion of more advanced energy terms (such as multipolar electrostatics and polarization).…”
Section: Perspective: Gas-phase Simulation Of a Proteinmentioning
confidence: 99%