Quantitative structure-property relationships (QSPR) are developed to correlate glass transition temperatures and chemical structure. Both monomer and repeat unit structures are used to build several QSPR models for Parts 1 and 2 of this study, respectively. Models are developed using numerical descriptors, which encode important information about chemical structure (topological, electronic, and geometric). Multiple linear regression analysis (MLRA) and computational neural networks (CNNs) are used to generate the models after descriptor generation. Optimization routines (simulated annealing and genetic algorithm) are utilized to find information-rich subsets of descriptors for prediction. A 10-descriptor CNN model was found to be optimal in predicting T(g) values using the monomer structure (Part 1) for 165 polymers. A committee of 10 CNNs produced a training set rms error of 10.1K (r2 = 0.98) and a prediction set rms error of 21.7 K (r2 = 0.92). An 11-descriptor CNN model was developed for 251 polymers using the repeat unit structure (Part 2). A committee of CNNs produced a training set rms error of 21.1K (r2 = 0.96) and a prediction set rms error of 21.9 K (r2 = 0.96).
Binary quantitative structure-activity relationship (QSAR) models are developed to classify a data set of 334 aromatic and secondary amine compounds as genotoxic or nongenotoxic based on information calculated solely from chemical structure. Genotoxic endpoints for each compound were determined using the SOS Chromotest in both the presence and absence of an S9 rat liver homogenate. Compounds were considered genotoxic if assay results indicated a positive genotoxicity hit for either the S9 inactivated or S9 activated assay. Each compound in the data set was encoded through the calculation of numerical descriptors that describe various aspects of chemical structure (e.g. topological, geometric, electronic, polar surface area). Furthermore, five additional descriptors that focused on the secondary and aromatic nitrogen atoms in each molecule were calculated specifically for this study. Descriptor subsets were examined using a genetic algorithm search engine interfaced with a k-Nearest Neighbor fitness evaluator to find the most information-rich subsets, which ultimately served as the final predictive models. Models were chosen for their ability to minimize the total number of misclassifications, with special attention given to those models that possessed fewer occurrences of positive toxicity hits being misclassified as nontoxic (false negatives). In addition, a subsetting procedure was used to form an ensemble of models using different combinations of compounds in the training and prediction sets. This was done to ensure that consistent results could be obtained regardless of training set composition. The procedure also allowed for each compound to be externally validated three times by different training set data with the resultant predictions being used in a "majority rules" voting scheme to produce a consensus prediction for each member of the data set. The individual models produced an average training set classification rate of 71.6% and an average prediction set classification rate of 67.7%. However, the model ensemble was able to correctly classify the genotoxicity of 72.2% of all prediction set compounds.
Mathematical models are developed to find quantitative structure-activity relationships that correlate chemical structure and inhibition toward three carbonic anhydrase (CA) isozymes: CA I, II, and IV. Numerical descriptors are generated to encode important topological, geometric, and electronic features of molecular structure. After descriptor generation, multiple linear regression, and computational neural network (CNN) analyses are performed on various descriptor subsets to find superior models for prediction. Committees of five CNNs were utilized to average final predicted values for the 142-compound data set. For inhibitors of CA I, an 8-5-1 CNN committee produced a training set rms error of 0.105 log K(i) (r(2) = 0.994) and prediction set rms error of 0.208 log K(i) (r(2) = 0.980). Training and prediction set rms errors of 0.140 log K(i) (r(2) = 0.992) and 0.231 log K(i) (r(2) = 0.971), respectively, were produced by a 9-5-1 CNN committee for inhibitors of CA II. For prediction of CA IV inhibitors, an 8-5-1 CNN committee produced training and prediction set rms errors of 0.147 log K(i) (r(2) = 0.992) and 0.211 log K(i) (r(2) = 0.991), respectively. In addition, classification models were built using k-nearest neighbor (kNN) analysis to solve two- and three-class problems for inhibitors of CA IV. A three-descriptor classification model proved superior in labeling compounds as active or inactive inhibitors for the two-class problem. Training and prediction set percent classification rates of 100% and 87.1%, respectively, were obtained. For the three-class (active/moderate/inactive) problem, a five-descriptor model was deemed optimal producing a training set percent classification rate of 98.8% and prediction set rate of 79.0%.
High-throughput screening (HTS) has enabled millions of compounds to be assessed for biological activity, but challenges remain in the prioritization of hit series. While biological, absorption, distribution, metabolism, excretion, and toxicity (ADMET), purity, and structural data are routinely used to select chemical matter for further follow-up, the scarcity of historical ADMET data for screening hits limits our understanding of early hit compounds. Herein, we describe a process that utilizes a battery of in-house quantitative structure-activity relationship (QSAR) models to generate in silico ADMET profiles for hit series to enable more complete characterizations of HTS chemical matter. These profiles allow teams to quickly assess hit series for desirable ADMET properties or suspected liabilities that may require significant optimization. Accordingly, these in silico data can direct ADMET experimentation and profoundly impact the progression of hit series. Several prospective examples are presented to substantiate the value of this approach.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.