Machine learning algorithms have attained widespread use in assessing the potential toxicities of pharmaceuticals and industrial chemicals because of their faster-speed and lowercost compared to experimental bioassays. Gradient boosting is an effective algorithm that often achieves high predictivity, but historically the relative long computational time limited its applications in predicting large compound libraries or developing in silico predictive models that require frequent retraining. LightGBM, a recent improvement of the gradient boosting algorithm inherited its high predictivity but resolved its scalability and long computational time by adopting leaf-wise tree growth strategy and introducing novel techniques. In this study, we compared the predictive performance and the computational time of LightGBM to deep neural networks, random forests, support vector machines, and XGBoost. All algorithms were rigorously evaluated on publicly available Tox21 and mutagenicity datasets using a Bayesian optimization integrated nested 10-fold cross-validation scheme that performs hyperparameter optimization while examining model generalizability and transferability to new data. The evaluation results demonstrated that LightGBM is an effective and highly scalable algorithm offering the best predictive performance while consuming significantly shorter computational time than the other investigated algorithms across all Tox21 and mutagenicity datasets. We recommend LightGBM for applications in in silico safety assessment and also in other areas of cheminformatics to fulfill the ever-growing demand for accurate and rapid prediction of various toxicity or activity related endpoints of large compound libraries present in the pharmaceutical and chemical industry.
Virtual screening is widely applied in drug discovery, and significant effort has been put into improving current methods. In this study, we have evaluated the performance of compound ranking in virtual screening using five different data fusion algorithms on a total of 16 data sets. The data were generated by docking, pharmacophore search, shape similarity, and electrostatic similarity, spanning both structure- and ligand-based methods. The algorithms used for data fusion were sum rank, rank vote, sum score, Pareto ranking, and parallel selection. None of the fusion methods require any prior knowledge or input other than the results from the single methods and, thus, are readily applicable. The results show that compound ranking using data fusion improves the performance and consistency of virtual screening compared to the single methods alone. The best performing data fusion algorithm was parallel selection, but both rank voting and Pareto ranking also have good performance.
Multiple sclerosis (MS) is a T-cell-mediated disease of the central nervous system, characterized by damage to myelin and axons, resulting in progressive neurological disability. Genes may influence susceptibility to MS, but results of association studies are inconsistent, aside from the identification of HLA class II haplotypes. Whole-genome linkage screens in MS have both confirmed the importance of the HLA region and uncovered non-HLA loci that may harbor susceptibility genes. In this twostage analysis, we determined genotypes, in up to 672 MS patients and 672 controls, for 123 single-nucleotide polymorphisms (SNPs) in 66 genes. Genes were chosen based on their chromosomal positions or biological functions. In stage one, 22 genes contained at least one SNP for which the carriage rate for one allele differed significantly (Po0.08) between patients and controls. After additional genotyping in stage two, two genes-each containing at least three significantly (Po0.05) associated SNPs-conferred susceptibility to MS: LAG3 on chromosome 12p13, and IL7R on 5p13. LAG3 inhibits activated T cells, while IL7R is necessary for the maturation of T and B cells. These results imply that germline allelic variation in genes involved in immune homeostasis-and, by extension, derangement of immune homeostasis-influence the risk of MS.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.