There is a pressing need for accurate in silico methods to predict the toxicity of molecules that are being introduced into the environment or are being developed into new pharmaceuticals. Predictive toxicology is in the realm of structure activity relationships (SAR), and many approaches have been used to derive such SAR. Previous work has shown that inductive logic programming (ILP) is a powerful approach that circumvents several major difficulties, such as molecular superposition, faced by some other SAR methods. The ILP approach reasons with chemical substructures within a relational framework and yields chemically understandable rules. Here, we report a general new approach, support vector inductive logic programming (SVILP), which extends the essentially qualitatiVe ILP-based SAR to quantitatiVe modeling. First, ILP is used to learn rules, the predictions of which are then used within a novel kernel to derive a support-vector generalization model. For a highly heterogeneous dataset of 576 molecules with known fathead minnow fish toxicity, the cross-validated correlation coefficients (R 2 CV ) from a chemical descriptor method (CHEM) and SVILP are 0.52 and 0.66, respectively. The ILP, CHEM, and SVILP approaches correctly predict 55, 58, and 73%, respectively, of toxic molecules. In a set of 165 unseen molecules, the R 2 values from the commercial software TOPKAT and SVILP are 0.26 and 0.57, respectively. In all calculations, SVILP showed significant improvements in comparison with the other methods. The SVILP approach has a major advantage in that it uses ILP automatically and consistently to derive rules, mostly novel, describing fragments that are toxicity alerts. The SVILP is a general machine-learning approach and has the potential of tackling many problems relevant to chemoinformatics including in silico drug design.
In this paper, we describe our approach for addressing Task 1 in the KDD CUP 2002 competition. The approach is based on developing and using an improved automatic feature selection method in conjunction with traditional classifiers. The feature selection method used is based on capturing frequently occurring keyword combinations (or motifs) within short segments of the text of a document and has proved to produce more accurate classification results than approaches relying solely on using keyword-based features.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.