Federated multi-partner machine learning can be an appealing and efficient method to increase the effective training data volume and thereby the predictivity of models, particularly when the generation of training data is resource intensive. In the landmark MELLODDY project, each of ten pharmaceutical companies realized aggregated improvements on its own classification and/or regression models through federated learning. To this end, they leveraged a novel implementation extending multi-task learning across partners, on a platform audited for privacy and security. The experiments involved an unprecedented cross-pharma dataset of 2.6+ billion confidential experimental activity data points, documenting 21+ million physical small molecules and 40+ thousand assays in on-target and secondary pharmacodynamics and pharmacokinetics. Appropriate complementary metrics were developed to evaluate predictive performance in the federated setting. In addition to predictive performance increases in labeled space, the results point towards an extended applicability domain in federated learning. Increases in collective training data volume, including by means of auxiliary data resulting from single concentration high-throughput and imaging assays, continued to boost predictive performances, albeit with saturating return. Markedly higher improvements were observed for pharmacokinetics and safety panel assay-based task subsets.
Machine learning models predicting the bioactivity of chemical compounds belong nowadays to the standard tools of cheminformaticians and computational medicinal chemists. Multi-task and federated learning are promising machine learning approaches that allow privacy-preserving usage of large amounts of data from diverse sources, which is crucial for achieving good generalization and high-performance results. Using large, real world data sets from six pharmaceutical companies, here we investigate different strategies for averaging weighted task loss functions to train multi-task bioactivity classification models. The weighting strategies shall be suitable for federated learning and ensure that learning efforts are well distributed even if data are diverse. Comparing several approaches using weights that depend on the number of sub-tasks per assay, task size, and class balance, respectively, we find that a simple sub-task weighting approach leads to robust model performance for all investigated data sets and is especially suited for federated learning.
Machine learning models predicting the bioactivity of chemical compounds belong nowadays to the standard tools of cheminformaticians and computational medicinal chemists. Multi-task and federated learning are promising machine learning approaches that allow priva-cy-preserving usage of large amount of data from diverse sources, which is crucial for achieving good generalization and high-performance results. Using large, real world data sets from six pharmaceutical companies, here we investigate different strategies for averaging weighted task loss functions to train multi-task bioactivity classification models. The weighting strategies shall be suitable for federated learning and ensure that learning efforts are well distributed even if data are diverse. Comparing several approaches using weights that depend on the number of sub-tasks per assay, task size, and class balance, respectively, we find that a simple sub-task weighting approach leads to robust model performance for all investigated data sets and is especially suited for federated learning.
Which variables determine the rate of gene and protein sequence evolution is a central question in evolutionary genomics. In the model organism fission yeast (Schizosaccharomyces pombe), the determinants of the rate of sequence evolution have yet to be determined. Previous studies in other organisms have typically found gene expression levels to be most significant, with numerous other variables identified as having a smaller impact. Here, partial least squares regression (PLS) and partial correlation analysis are used to model sequence evolution rates in the fission yeast genome by a range of variables. Variable importance in projection (VIP) scores as well as partial correlation coefficients are calculated for each variable, and used as estimates of the influence of each independent variable on sequence evolution rate. Unlike many previous studies in other organisms, centrality in the PPI network is shown to be the most important variable, and gene expression found to be less influential. Considerable heterogeneity is found in the influence of different gene ontology terms as well as amino acid composition. However, the majority of variance in constraint in fission yeast remains unexplained by this study, indicating that variables not yet considered and stochastics probably have considerable impact on the rate of molecular evolution.
Machine learning models predicting the bioactivity of chemical compounds belong nowadays to the standard tools of cheminformaticians and computational medicinal chemists. Multi-task and federated learning are promising machine learning approaches that allow priva-cy-preserving usage of large amount of data from diverse sources, which is crucial for achieving good generalization and high-performance results. Using large, real world data sets from six pharmaceutical companies, here we investigate different strategies for averaging weighted task loss functions to train multi-task bioactivity classification models. The weighting strategies shall be suitable for federated learning and ensure that learning efforts are well distributed even if data are diverse. Comparing several approaches using weights that depend on the number of sub-tasks per assay, task size, and class balance, respectively, we find that a simple sub-task weighting approach leads to robust model performance for all investigated data sets and is especially suited for federated learning.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.