Application of k-means clustering, linear discriminant analysis and multivariate linear regression for the development of a predictive QSAR model on 5-lipoxygenase inhibitors

Andrada, Matías F.; Vega-Hissi, Esteban G.; Estrada, Mario R.; Martínez, Juan C. Garro

doi:10.1016/j.chemolab.2015.03.001

Cited by 23 publications

(16 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Matías used K-means clustering and Linear Discriminant analysis for the selection of training and testing sets, and divided 58 derivatives into two clusters. The results show that a molecular descriptor correctly discriminates 100% of the compounds of each cluster [37]. So K-means clustering analysis is a good method for classifying multiple study objects into featured groups.…”

Section: Discussionmentioning

confidence: 99%

Identifying and Classifying Pollution Hotspots to Guide Watershed Management in a Large Multiuse Watershed

Kaplan

et al. 2017

IJERPH

View full text Add to dashboard Cite

In many locations around the globe, large reservoir sustainability is threatened by land use change and direct pollution loading from the upstream watershed. However, the size and complexity of upstream basins makes the planning and implementation of watershed-scale pollution management a challenge. In this study, we established an evaluation system based on 17 factors, representing the potential point and non-point source pollutants and the environmental carrying capacity which are likely to affect the water quality in the Dahuofang Reservoir and watershed in northeastern China. We used entropy methods to rank 118 subwatersheds by their potential pollution threat and clustered subwatersheds according to the potential pollution type. Combining ranking and clustering analyses allowed us to suggest specific areas for prioritized watershed management (in particular, two subwatersheds with the greatest pollution potential) and to recommend the conservation of current practices in other less vulnerable locations (91 small watersheds with low pollution potential). Finally, we identified the factors most likely to influence the water quality of each of the 118 subwatersheds and suggested adaptive control measures for each location. These results provide a scientific basis for improving the watershed management and sustainability of the Dahuofang reservoir and a framework for identifying threats and prioritizing the management of watersheds of large reservoirs around the world.

show abstract

Section: Discussionmentioning

confidence: 99%

Identifying and Classifying Pollution Hotspots to Guide Watershed Management in a Large Multiuse Watershed

Kaplan

et al. 2017

IJERPH

View full text Add to dashboard Cite

show abstract

“…In this work we use Kmeans [14]. K-means is one of the algorithms that solve clustering problem [14]. This technique can be used for two cases:…”

Section: Quantizationmentioning

confidence: 99%

A new representation for 3D objects: Binary matrix

Aznag

Kane

Oirrak

et al. 2015

2015 Third World Conference on Complex Systems (WCCS)

View full text Add to dashboard Cite

In this work, a new method is presented for the representation of 3D objects with binary matrix. This method is based on two stages: normalization and quantization. This representation allows us to compare 3D objects by computing the similarity between them. In fact our algorithm compute binary matrix, frequency matrix and cluster coordinates. So we can identify an object by comparing those representations.

show abstract

“…However, both studies still carried out a random selection method of molecules in the data partitioning stage. According to [6], a random selection of molecules can lead to a mismatch because all members of the validation data may be members of the same group, thereby resulting in a molecular set that is not representative of the real data. Thus, a method is needed that can produce a representative data set in the data partition stage [2], [6], [7].…”

Section: Introductionmentioning

confidence: 99%

Artificial Intelligence Paradigm for Ligand-Based Virtual Screening on the Drug Discovery of Type 2 Diabetes Mellitus

Bustamam

Hamzah

Husna

et al. 2021

Preprint

View full text Add to dashboard Cite

Background: New dipeptidyl peptidase-4 (DPP-4) inhibitors need to be developed to be used as agents with low adverse effects for the treatment of type 2 diabetes mellitus. This study aims to build quantitative structure-activity relationship (QSAR) models using the artificial intelligence paradigm. Random Forest and Deep Neural Network are used to predict QSAR models. We compared principal component analysis (PCA) with sparse PCA as methods for transforming Rotation Forest. K-modes clustering with Levenshtein distance was used for the selection method of molecules, and CatBoost was used for the feature selection method. Results: The amount of the DPP-4 inhibitor molecules resulting from the selection process of molecules using K-Modes clustering algorithm is 1020 with logP range value of -1.6693 to 4.99044. Several fingerprint methods such as extended connectivity fingerprint and functional class fingerprint with diameters of 4 and 6 were used to construct four fingerprint datasets, ECFP_4, ECFP_6, FCFP_4, and FCFP_6. There are 1024 features from the four fingerprint datasets that are then selected using the CatBoost method. CatBoost can represent QSAR models with good performance for machine learning and deep learning methods respectively with evaluation metrics, such as Sensitivity, Specificity, Accuracy, and Matthew's correlation coefficient, all valued above 70% with a feature importance level of 60%, 70%, 80%, and 90%.Conclusions: The K-Modes clustering algorithm can produce a representative subset of DPP-4 inhibitor molecules. Feature selection in the fingerprint dataset using CatBoost is best used before making QSAR Classification and QSAR Regression models. QSAR Classification using Machine Learning and QSAR Classification using Deep Learning, each of which has an accuracy of above 70%. The Rotation Forest (PCA) model performed better than the Rotation Forest (Sparse PCA) model, both in the QSAR Classification and QSAR Regression model because the Rotation Forest (PCA) has a more effective time than the Rotation Forest (Sparse PCA).

show abstract

Application of k-means clustering, linear discriminant analysis and multivariate linear regression for the development of a predictive QSAR model on 5-lipoxygenase inhibitors

Cited by 23 publications

References 39 publications

Identifying and Classifying Pollution Hotspots to Guide Watershed Management in a Large Multiuse Watershed

Identifying and Classifying Pollution Hotspots to Guide Watershed Management in a Large Multiuse Watershed

A new representation for 3D objects: Binary matrix

Artificial Intelligence Paradigm for Ligand-Based Virtual Screening on the Drug Discovery of Type 2 Diabetes Mellitus

Contact Info

Product

Resources

About