A Step Towards Generalisability: Training a Machine Learning Scoring Function for Structure-Based Virtual Screening

Scantlebury, Jack; Vost, Lucy; Carbery, A.; Hadfield, Thomas E.; Turnbull, Oliver M; Brown, Nathan; Chenthamarakshan, Vijil; Das, Payel; Grosjean, Harold; Delft, Frank von; Deane, Charlotte M.

doi:10.1101/2022.10.28.511712

Cited by 8 publications

(25 citation statements)

References 53 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finally we compared the performance of the fingerprintbased models to a recently proposed Equivariant graph neural network, PointVS. 22 We found that although all models were able to accurately predict binding in the presence of ligand-specific biases, their ability to attribute binding to the correct functional groups was substantially degraded, indicating they were less able to generalise than models which were not susceptible to ligand-specific biases. We also found that the attribution performance of the fingerprint-based models was heavily dependent on the parameters used to define the fingerprint, and that they were less able to identify the most important functional groups compared to the EGNN method, PointVS.…”

Section: Introductionmentioning

confidence: 82%

“…It must instead learn to identify important interactions from the atomic coordinates and atom types. Scantlebury et al 22 also applied a further distance cutoff, where any receptor atom which was not within 6 Å of any ligand atom was ignored; this reduced the dimensionality of the input graph by ignoring residues which were not part of the binding pocket. As we constrained each synthetic protein to be within a box defined as [x min −5, x max +5]×[y min −5, y max +5]× [z min − 5, z max + 5], where x min was the smallest ligand atom x-coordinate and the other values were defined similarly, the vast majority of synthetic residues would be within 6 Å of at least one ligand atom and so this cutoff should have minimal impact on the performance of PointVS.…”

Section: Contribution-based Generative Processmentioning

confidence: 99%

“…We trained 8 different RF PLEC models, varying the PLEC distance cutoff by 0.5 Å from 2.5 Å to 6 Å. To train PointVS, we used the default hyperparameters outlined by Scantlebury et al 22 .…”

Section: Contribution-based Generative Processmentioning

confidence: 99%

“…As higher scoring molecules are classified as active examples, masking assigns a high level of importance to atoms whose omission drastically reduces the model's confidence that an example has an active label. Whilst Scantlebury et al 22 used an attention mechanism to score the relative importance of different atomic interactions, we used masking for the experiments in this paper as it can be used to generate attributions for any predictive model, allowing a closer comparison between different models.…”

Section: Contribution-based Generative Processmentioning

confidence: 99%

“…Whilst several authors have used attribution 2 techniques on real-world data to uncover important functional groups, 16,21,22 it is often difficult to ascertain the precise contribution of each atom in an experimentally obtained proteinligand complex. Combined with the difficulty in manually curating a large-scale test set, it is currently infeasible to objectively assess the attribution performance of ML algorithms on real-world virtual screening tasks.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Exploring The Ability Of Machine Learning-Based Virtual Screening Models To Identify The Functional Groups Responsible For Binding

Hadfield

Scantlebury

Deane

2023

Preprint

View full text Add to dashboard Cite

Many recently proposed structure-based virtual screening models appear to be able to accurately distinguish high affinity binders from non-binders. However, several recent studies have shown that they often do so by exploiting ligand-specific biases in the dataset, rather than identifying favourable intermolecular interactions in the input protein-ligand complex. In this work we propose a novel approach for assessing the extent to which machine learning-based virtual screening models are able to identify the functional groups responsible for binding. To sidestep the difficulty in establishing the ground truth importance of each atom of a large scale set of protein-ligand complexes, we propose a protocol for generating synthetic data where the label of an example is assigned by a 3-dimensional deterministic binding rule. This allows us to precisely quantify the ground truth importance of each atom and compare it to the model generated attributions. Using our generated datasets, we demonstrate that a recently proposed deep learning-based virtual screening model, PointVS, identified the most important functional groups with 39% more efficiency than a fingerprint-based random forest, suggesting that it would generalise more effectively to new examples. In addition, we found that ligand-specific biases, such as those present in widely used virtual screening datasets, substantially impaired the ability of all ML models to identify the most important functional groups. We have made our synthetic data generation framework available to facilitate the benchmarking of new virtual screening models. Code is available at https://github.com/tomhadfield95/synthVS.

show abstract

Section: Introductionmentioning

confidence: 82%

Section: Contribution-based Generative Processmentioning

confidence: 99%

Section: Contribution-based Generative Processmentioning

confidence: 99%

Section: Contribution-based Generative Processmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Exploring The Ability Of Machine Learning-Based Virtual Screening Models To Identify The Functional Groups Responsible For Binding

Hadfield

Scantlebury

Deane

2023

Preprint

View full text Add to dashboard Cite

show abstract

Modern machine‐learning for binding affinity estimation of protein–ligand complexes: Progress, opportunities, and challenges

Harren,

Gutermuth,

Grebner

et al. 2024

WIREs Comput Mol Sci

View full text Add to dashboard Cite

Structure‐based drug design is a widely applied approach in the discovery of new lead compounds for known therapeutic targets. In most structure‐based drug design applications, the docking procedure is considered the crucial step. Here, a potential ligand is fitted into the binding site, and a scoring function assesses its binding capability. With the rise of modern machine‐learning in drug discovery, novel scoring functions using machine‐learning techniques achieved significant performance gains in virtual screening and ligand optimization tasks on retrospective data. However, real‐world applications of these methods are still limited. Missing success stories in prospective applications are one reason for this. Additionally, the fast‐evolving nature of the field makes it challenging to assess the advantages of each individual method. This review will highlight recent strides toward improved real world applicability of machine‐learning based scoring, enabling a better understanding of the potential benefits and pitfalls of these functions on a project. Furthermore, a systematic way of classifying machine‐learning based scoring that facilitates comparisons will be presented.This article is categorized under: Data Science > Chemoinformatics Data Science > Artificial Intelligence/Machine Learning Software > Molecular Modeling

show abstract

A High-Quality Data Set of Protein–Ligand Binding Interactions Via Comparative Complex Structure Modeling

Li,

Shen,

Zhu

et al. 2024

J. Chem. Inf. Model.

View full text Add to dashboard Cite

High-quality protein−ligand complex structures provide the basis for understanding the nature of noncovalent binding interactions at the atomic level and enable structure-based drug design. However, experimentally determined complex structures are scarce compared with the vast chemical space. In this study, we addressed this issue by constructing the BindingNet data set via comparative complex structure modeling, which contains 69,816 modeled high-quality protein−ligand complex structures with experimental binding affinity data. BindingNet provides valuable insights into investigating protein−ligand interactions, allowing visual inspection and interpretation of structural analogues' structure−activity relationships. It can also be used for evaluating machine-learning-based scoring functions. Our results indicate that machine learning models trained on BindingNet could reduce the bias caused by buried solvent-accessible surface area, as we previously found for models trained on the PDBbind data set. We also discussed strategies to improve BindingNet and its potential utilization for benchmarking the molecular docking methods and ligand binding free energy calculation approaches. The BindingNet complements PDBbind in constructing a sufficient and unbiased protein−ligand binding data set and is freely available at http://bindingnet.huanglab.org.cn.

show abstract

A Step Towards Generalisability: Training a Machine Learning Scoring Function for Structure-Based Virtual Screening

Cited by 8 publications

References 53 publications

Exploring The Ability Of Machine Learning-Based Virtual Screening Models To Identify The Functional Groups Responsible For Binding

Exploring The Ability Of Machine Learning-Based Virtual Screening Models To Identify The Functional Groups Responsible For Binding

Modern machine‐learning for binding affinity estimation of protein–ligand complexes: Progress, opportunities, and challenges

A High-Quality Data Set of Protein–Ligand Binding Interactions Via Comparative Complex Structure Modeling

Contact Info

Product

Resources

About