ABSTRACT:Numerous experimental and computational approaches have been developed to predict human drug metabolism. Since databases of human drug metabolism information are widely available, these can be used to train computational algorithms and generate predictive approaches. In turn, they may be used to assist in the identification of possible metabolites from a large number of molecules in drug discovery based on molecular structure alone. In the current study we have used a commercially available database (MetaDrug) and extracted a fraction of the human drug metabolism data. These data were used along with augmented atom descriptors in a predictive machine learning model, kernel-partial least squares (K-PLS). A total of 317 molecules, including parent drugs and their primary and secondary (sequential) metabolites, were used to build these models corresponding to individual metabolism rules, representing the formation of discrete metabolites, e.g., N-dealkylation. Each model was internally validated to assess the capability to classify other molecules that were left out. Using receiver operator curve statistics models for N-dealkylation, Odealkylation, aromatic hydroxylation, aliphatic hydroxylation, Oglucuronidation, and O-sulfation gave area under the curve values from 0.75 to 0.84 and were able to predict between 61 and 79% active molecules upon leave-one-out testing. This preliminary study indicates that K-PLS and possibly other similar machine learning methods (such as support vector machines) can be applied to predicting human drug metabolite formation in a classification manner. Improvements can be achieved using considerably larger datasets that contain more positive examples for the less frequently occurring metabolite rules, as well as the external evaluation of novel molecules.With the emphasis now on increasing the efficiency of drug discovery, there is interest in using predictive computational approaches to complement in vitro and in vivo studies. In the area of metabolism prediction, these techniques encompass pharmacophores (Ekins et al., 2001), quantitative structure-activity relationships (QSARs) (Shen et al., 2003;Balakin et al., 2004), electronic models (Korzekwa et al., 2004), and commercial drug metabolism databases (Borodina et al., 2004), as well as other methods that have been comprehensively reviewed elsewhere (de Graaf et al., 2005;Ekins et al., 2005a;de Groot, 2006). Some approaches have combined metabolite data and rules for suggesting metabolic pathways across multiple species (Erhardt, 2003). Such databases may also be useful for calculating the probability for a given metabolic reaction (Boyer and Zamora, 2002) to then indicate potential metabolites and the sites of metabolism using statistical or algorithmic approaches (Borodina et al., 2004). Although these types of comprehensive databases generally enable numerous search options to retrieve molecule structures and published information, the predictive capabilities seem limited at present (Wishart et al., 2006). A major limita...