There is a great need to assess the harmful effects or toxicities of chemicals to which man is exposed. In the present paper, the simplified molecular input line entry specification (SMILES) representation-based string kernel, together with the state-of-the-art support vector machine (SVM) algorithm, were used to classify the toxicity of chemicals from the US Environmental Protection Agency Distributed Structure-Searchable Toxicity (DSSTox) database network. In this method, the molecular structure can be directly encoded by a series of SMILES substrings that represent the presence of some chemical elements and different kinds of chemical bonds (double, triple and stereochemistry) in the molecules. Thus, SMILES string kernel can accurately and directly measure the similarities of molecules by a series of local information hidden in the molecules. Two model validation approaches, five-fold cross-validation and independent validation set, were used for assessing the predictive capability of our developed models. The results obtained indicate that SVM based on the SMILES string kernel can be regarded as a very promising and alternative modelling approach for potential toxicity prediction of chemicals.
Toxicity of chemicals induced by different factors is an important consideration, especially during the drug research and development process. Thus, there is urgent need to develop computationally effective models that can predict the toxicity or adverse effects of chemicals for a specific class of chemicals. In this study, random forest (RF) was used to classify five toxicity data sets from Distributed Structure-Searchable Toxicity database network, using substructure fingerprints calculated directly from simple molecular structure. Three model validation approaches, out-of-bag validation incorporated in RF, fivefold cross-validation, and an independent validation set, were used for assessing the prediction capability of our models. The chemical space analysis of data sets was explored by multidimensional scaling plots, and outlying molecules were also detected by the proximity measure in RF. At the same time, the important substructure fingerprints, recognized by the RF technique, gave some insights into the structure features related to toxicity of chemicals. The results obtained showed that these in silico classification models with substructure patterns and RF are applicable for potential toxicity prediction of chemical compounds.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.