A new classification and regression tool, Random Forest, is introduced and investigated for predicting a compound's quantitative or categorical biological activity based on a quantitative description of the compound's molecular structure. Random Forest is an ensemble of unpruned classification or regression trees created by using bootstrap samples of the training data and random feature selection in tree induction. Prediction is made by aggregating (majority vote or averaging) the predictions of the ensemble. We built predictive models for six cheminformatics data sets. Our analysis demonstrates that Random Forest is a powerful tool capable of delivering performance that is among the most accurate methods to date. We also present three additional features of Random Forest: built-in performance assessment, a measure of relative importance of descriptors, and a measure of compound similarity that is weighted by the relative importance of descriptors. It is the combination of relatively high prediction accuracy and its collection of desired features that makes Random Forest uniquely suited for modeling in cheminformatics.
How well can a QSAR model predict the activity of a molecule not in the training set used to create the model? A set of retrospective cross-validation experiments using 20 diverse in-house activity sets were done to find a good discriminator of prediction accuracy as measured by root-mean-square difference between observed and predicted activity. Among the measures we tested, two seem useful: the similarity of the molecule to be predicted to the nearest molecule in the training set and/or the number of neighbors in the training set, where neighbors are those more similar than a user-chosen cutoff. The molecules with the highest similarity and/or the most neighbors are the best-predicted. This trend holds true for narrow training sets and, to a lesser degree, for many diverse training sets and does not depend on which QSAR method or descriptor is used. One may define the similarity using a different descriptor than that used for the QSAR model. The similarity dependence for diverse training sets is somewhat unexpected. It appears to be greater for those data sets where the association of similar activities vs similar structures (as encoded in the Patterson plot) is stronger. We propose a way to estimate the reliability of the prediction of an arbitrary chemical structure on a given QSAR model, given the training set from which the model was derived.
A three-body potential suitable for molecular dynamics (MD) simulations has been developed for vitreous silica by adding three-body interactions to the Born–Mayer–Huggins (BMH) pair potential. Previous MD simulations with the BMH potential have formed glassy SiO2 through the melt-quench method with some success. Though bond lengths were found to be in fair agreement with experiment, the distribution of tetrahedral angles was too broad and the model glass contained 6%–8% bond defects. This is indicative of a lack of the local order that is present in the laboratory glass. The nature of the short range order is expected to play an important role when investigating defect formation, surface reconstruction, or surface reactivities. An attempt has been made to increase the local order in the simulated glass by including a directional dependent term in the effective potential to model the partial covalency of the Si–O bond. The vitreous state obtained through MD simulation with this modified BMH potential shows an increase in the short range order with a narrow O–Si–O angle distribution peaked about the tetrahedral angle and a low concentration of bond defects, typically ∼1%–2%. The static structure factor S(q) is calculated and found to be in good agreement with neutron scattering results. Intermediate range order is also discussed in reference to the distribution of ring sizes.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.