As training volume increases predictive model quality, leveraging existing external data sources holds the promise of time- and cost-efficiency. In a drug discovery setting, pharmaceutical companies all own substantial but confidential datasets. The MELLODDY project develops a privacy-preserving federated machine learning solution and deploys it at an unprecedented scale (more than 100,000 tasks across ten major pharmaceutical companies), while ensuring the security and privacy of each partner’s sensitive data. Each partner builds models that benefit from a shared representation, for their own private assays. Established predictive performance metrics such as AUC ROC or AUC PR are constrained to unseen labelled chemical space. However, they cannot gauge performance gains in unlabelled chemical space. Federated learning indirectly extends labelled space, but in a privacy-preserving context, a partner cannot use this label extension for performance assessment. Metrics that estimate uncertainty on a prediction can be calculated even where no label is known. Practically, the chemical space covered with predictions of sufficient confidence, reflects the applicability domain of a model. After establishing a link to established performance metrics, we propose the efficiency from the conformal prediction framework (‘conformal efficiency’) as a proxy to the applicability domain size. A documented extension of the applicability domain would qualify as a tangible benefit from federated learning. In interim assessments, MELLODDY partners report a median increase in conformal efficiency of the federated over the single-partner model of 5.5% (with increases up to 9.7%). Subject to distributional conditions, that efficiency increase can be directly interpreted as the expected increase in conformal i.e. high confidence predictions. In conclusion, we present the first evidence that privacy-preserving federated machine learning across massive drug-discovery datasets from ten pharma partners indeed extends the applicability domain of property prediction models.
Motivation: In silico prediction of protein-ligand binding is a hot topic in computational chemistry and machine learning-based drug discovery, as an accurate prediction model could reduce the time and resources required to detect and identify and prioritize potential drug candidates. Proteochemometric modelling (PCM) is a promising approach for in-silico protein-ligand binding prediction that utilises both compound and target descriptors. However, in its original form PCM model cannot separate multiple assays associated with the same target. Therefore, a practitioner applying PCM approach to modelling experimental data has either to select only one assay for each target, and thus exclude potentially significant amount of data, or pull measurements from different assays together effectively mixing possibly very different functional dependencies between (protein, ligand) pairs and experimental measurements. Results: We describe two modifications of PCM models that increase its flexibility allowing to separate multiple assays associated with the same target. Evaluated on a subset of internal Bayer dose-response data and ChEMBL, these approaches result in improved performance compared to standard PCM models. Our results demonstrate importance of disentangling multiple assays associated with the same target when using PCM methodology in pharmaceutical environment. Availability: Source code is made publicly available on GitHub for non-commercial usage after publication.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.