Computational approaches for synthesizing new chemical compounds have resulted in a major explosion of chemical data in the field of drug discovery. The quantitative structure-activity relationship (QSAR) is a widely used classification and regression method used to represent the relationship between a chemical structure and its activities. This research focuses on the effect of dimensionality-reduction techniques on a high-dimensional QSAR dataset. Because of the multi-dimensional nature of QSAR, dimensionality-reduction techniques have become an integral part of its modeling process. Principal component analysis (PCA) is a feature extraction technique with several applications in exploratory data analysis, visualization and dimensionality reduction. However, linear PCA is inadequate to handle the complex structure of QSAR data. In light of the wide array of current feature-extraction techniques, we perform a comparative empirical study to investigate five feature-extraction techniques: PCA, kernel PCA, deep generalized autoencoder (dGAE), Gaussian random projection (GRP), and sparse random projection (SRP). The experiments are performed on a high-dimensional QSAR dataset, which comprises 6394 features. The transformed low-dimensional dataset is inputted into a deep learning classification model to predict a QSAR biological activity. Three approaches are adopted to validate and measure the proposed techniques: (i) comparing the performance of the classification models, (ii) visualizing the relationship (correlation) between features in the low-dimension Euclidean space, and (iii) validating the proposed techniques using an external dataset. To the best of our knowledge, this study is the first to investigate and compare the aforementioned feature-extraction techniques in QSAR modeling context. The results obtained provide invaluable insights regarding the behavior of different techniques with both negative and positive classes. With linear PCA as a baseline, we prove that the investigated techniques substantially outperform the baseline in multiple accuracy measures and demonstrate useful ways of extracting significant features. INDEX TERMS Autoencoder, blood-brain barrier (BBB) permeability, deep generalized autoencoder (dGAE), dimensioanlity reduction, feature extraction, Gaussian random projection, principal component analysis, quantitative structure-activity relation (QSAR), sparse random projection.
The blood–brain barrier plays a crucial role in regulating the passage of 98% of the compounds that enter the central nervous system (CNS). Compounds with high permeability must be identified to enable the synthesis of brain medications for the treatment of various brain diseases, such as Parkinson’s, Alzheimer’s, and brain tumors. Throughout the years, several models have been developed to solve this problem and have achieved acceptable accuracy scores in predicting compounds that penetrate the blood–brain barrier. However, predicting compounds with “low” permeability has been a challenging task. In this study, we present a deep learning (DL) classification model to predict blood–brain barrier permeability. The proposed model addresses the fundamental issues presented in former models: high dimensionality, class imbalances, and low specificity scores. We address these issues to enhance the high-dimensional, imbalanced dataset before developing the classification model: the imbalanced dataset is addressed using oversampling techniques and the high dimensionality using a non-linear dimensionality reduction technique known as kernel principal component analysis (KPCA). This technique transforms the high-dimensional dataset into a low-dimensional Euclidean space while retaining invaluable information. For the classification task, we developed an enhanced feed-forward deep learning model and a convolutional neural network model. In terms of specificity scores (i.e., predicting compounds with low permeability), the results obtained by the enhanced feed-forward deep learning model outperformed those obtained by other models in the literature that were developed using the same technique. In addition, the proposed convolutional neural network model surpassed models used in other studies in multiple accuracy measures, including overall accuracy and specificity. The proposed approach solves the problem inevitably faced with obtaining low specificity resulting in high false positive rate.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.