Using Graph Indices for the Analysis and Comparison of Chemical Datasets

Fourches, Denis; Tropsha, Alexander

doi:10.1002/minf.201300076

Cited by 25 publications

(18 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Compared to other chemical space representations, networks have the additional advantage that they can be characterized and compared in detail using a variety of statistical approaches from general network science [10,11]. However, only very few network-like representations of chemical space have been reported thus far [12][13][14][15].…”

Section: Introductionmentioning

confidence: 99%

Design of chemical space networks using a Tanimoto similarity variant based upon maximum common substructures

Zhang

Vogt

Maggiora

et al. 2015

J Comput Aided Mol Des

View full text Add to dashboard Cite

Chemical space networks (CSNs) have recently been introduced as an alternative to other coordinate-free and coordinate-based chemical space representations. In CSNs, nodes represent compounds and edges pairwise similarity relationships. In addition, nodes are annotated with compound property information such as biological activity. CSNs have been applied to view biologically relevant chemical space in comparison to random chemical space samples and found to display well-resolved topologies at low edge density levels. The way in which molecular similarity relationships are assessed is an important determinant of CSN topology. Previous CSN versions were based on numerical similarity functions or the assessment of substructure-based similarity. Herein, we report a new CSN design that is based upon combined numerical and substructure similarity evaluation. This has been facilitated by calculating numerical similarity values on the basis of maximum common substructures (MCSs) of compounds, leading to the introduction of MCS-based CSNs (MCS-CSNs). This CSN design combines advantages of continuous numerical similarity functions with a robust and chemically intuitive substructure-based assessment. Compared to earlier version of CSNs, MCS-CSNs are characterized by a further improved organization of local compound communities as exemplified by the delineation of drug-like subspaces in regions of biologically relevant chemical space.

show abstract

Section: Introductionmentioning

confidence: 99%

Design of chemical space networks using a Tanimoto similarity variant based upon maximum common substructures

Zhang

Vogt

Maggiora

et al. 2015

J Comput Aided Mol Des

View full text Add to dashboard Cite

show abstract

“…Often neglected, these curation steps are critical to detect mis‐annotated chemicals, structural errors, activity cliffs, and inter/intra‐lab experimental variability. Critical when using data extracted from the literature, chemical curation helps maximizing the prediction performances of QSPR models . This is particularly true with the presence of structural duplicates (i. e., identical compounds present several times in the same dataset) that is known to lead to over‐optimistic estimations of the predictivity for developed QSAR models.…”

Section: Methodsmentioning

confidence: 99%

Cheminformatics Modeling of Amine Solutions for Assessing their CO₂Absorption Properties

Kuenemann

Fourches

2017

Mol. Inf.

Self Cite

View full text Add to dashboard Cite

As stricter regulations on CO emissions are adopted worldwide, identifying efficient chemical processes to capture and recycle CO is of critical importance for industry. The most common process known as amine scrubbing suffers from the lack of available amine solutions capable of capturing CO efficiently. Tertiary amines characterized by low heats of reaction are considered good candidates but their absorption properties can significantly differ from one analogue to another despite high structural similarity. Herein, after collecting and curating experimental data from the literature, we have built a modeling set of 41 amine structures with their absorption properties. Then we analyzed their chemical composition using molecular descriptors and non-supervised clustering. Furthermore, we developed a series of quantitative structure-property relationships (QSPR) to assess amines' CO absorption properties from their structural characteristics. These models afforded reasonable prediction performances (e. g., Q =0.63 for CO absorption amount) even though they are solely based on 2D chemical descriptors and individual machine learning techniques (random forest and neural network). Overall, we believe the chemical analysis and the series of QSPR models presented in this proof-of-concept study represent new knowledge and innovative tools that could be very useful for screening and prioritizing hypothetical amines to be synthesized and tested experimentally for their CO absorption properties.

show abstract

“…Predictive performance of QSAR models highly depends upon different characteristics (e.g., size, chemical diversity, activity distribution or presence of activity cliffs) of various data sets [ 49 – 51 ]. It may not be always possible to build reliable QSAR models for certain data sets.…”

Section: Automated Model Buildingmentioning

confidence: 99%

An automated framework for QSAR model building

2018

View full text Add to dashboard Cite

BackgroundIn-silico quantitative structure–activity relationship (QSAR) models based tools are widely used to screen huge databases of compounds in order to determine the biological properties of chemical molecules based on their chemical structure. With the passage of time, the exponentially growing amount of synthesized and known chemicals data demands computationally efficient automated QSAR modeling tools, available to researchers that may lack extensive knowledge of machine learning modeling. Thus, a fully automated and advanced modeling platform can be an important addition to the QSAR community.ResultsIn the presented workflow the process from data preparation to model building and validation has been completely automated. The most critical modeling tasks (data curation, data set characteristics evaluation, variable selection and validation) that largely influence the performance of QSAR models were focused. It is also included the ability to quickly evaluate the feasibility of a given data set to be modeled. The developed framework is tested on data sets of thirty different problems. The best-optimized feature selection methodology in the developed workflow is able to remove 62–99% of all redundant data. On average, about 19% of the prediction error was reduced by using feature selection producing an increase of 49% in the percentage of variance explained (PVE) compared to models without feature selection. Selecting only the models with a modelability score above 0.6, average PVE scores were 0.71. A strong correlation was verified between the modelability scores and the PVE of the models produced with variable selection.ConclusionsWe developed an extendable and highly customizable fully automated QSAR modeling framework. This designed workflow does not require any advanced parameterization nor depends on users decisions or expertise in machine learning/programming. With just a given target or problem, the workflow follows an unbiased standard protocol to develop reliable QSAR models by directly accessing online manually curated databases or by using private data sets. The other distinctive features of the workflow include prior estimation of data modelability to avoid time-consuming modeling trials for non modelable data sets, an efficient variable selection procedure and the facility of output availability at each modeling task for the diverse application and reproduction of historical predictions. The results reached on a selection of thirty QSAR problems suggest that the approach is capable of building reliable models even for challenging problems.Electronic supplementary materialThe online version of this article (10.1186/s13321-017-0256-5) contains supplementary material, which is available to authorized users.

show abstract

Using Graph Indices for the Analysis and Comparison of Chemical Datasets

Cited by 25 publications

References 29 publications

Design of chemical space networks using a Tanimoto similarity variant based upon maximum common substructures

Design of chemical space networks using a Tanimoto similarity variant based upon maximum common substructures

Cheminformatics Modeling of Amine Solutions for Assessing their CO₂Absorption Properties

An automated framework for QSAR model building

Contact Info

Product

Resources

About

Using Graph Indices for the Analysis and Comparison of Chemical Datasets

Cited by 25 publications

References 29 publications

Design of chemical space networks using a Tanimoto similarity variant based upon maximum common substructures

Design of chemical space networks using a Tanimoto similarity variant based upon maximum common substructures

Cheminformatics Modeling of Amine Solutions for Assessing their CO2Absorption Properties

An automated framework for QSAR model building

Contact Info

Product

Resources

About

Cheminformatics Modeling of Amine Solutions for Assessing their CO₂Absorption Properties