SUMMARYFollowing the recent availability of high-throughput data for drug discovery, computational methods, especially machine learning based approaches, have gained remarkable attention.A number of studies use chemical, target and side effect similarity between drugs to build knowledge-based models that predict drug indications and drug-drug interactions. In light of previous works demonstrating the perils of cross-validation using paired data, in this study, we employ a disjoint cross validation approach for similarity-based drug-drug interaction (DDI) prediction and we investigate the prediction accuracy of classifier under various settings.Our results point to the dependence on the cross validation strategy used to evaluate prediction accuracy of drug similarity-based classifiers operating on paired data such as pharmacokinetic interactions between drugs.
KEYWORDSRational drug design; Drug-drug interaction; Paired data; Disjoint cross-validation; K-nearest-neighbor; Logistic regression.
AVAILABILITY AND REQUIREMENTSThe Jupyter Notebook, named interaction.ipynb containing the code used in this analysis is available in Repurpose framework (github.com/emreg00/ repurpose).
BODYFollowing the recent availability of high-throughput data for drug discovery, computational methods, especially machine learning based approaches, have gained remarkable attention. A number of studies use chemical, target and side effect similarity between drugs to build knowledge-based models that predict drug indications and drug-drug interactions [1][2][3]. The proposed models are typically benchmarked using cross-validation, in which the known drug-disease or drug-drug associations are split into training and test sets. Though these methods report areas under receiver operating characteristic (ROC) curves around 90% under cross-validation, their applicability in translational medicine and, thus, ability to reduce drug development costs has been controversial [2,4,5].In light of previous works highlighting the perils of cross-validation using paired data [6,7], we recently investigated the effect of using drug-wise disjoint cross-validation in predicting drug-disease pairs, where none of the drugs in the training set appeared in the test set [8]. We showed that the prediction accuracy of the classifier drops dramatically under such cross-validation setting, suggesting that the existing approaches are prone to over-fitting due to the inherent relationships in the data.Here, we turn our attention to disjoint cross-validation of similarity-based drug-drug interaction (DDI) prediction (Figure 1). Owing to the larger number of known drug-drug interactions, compared to the number of known drug-disease associations used in our previous study, we explore the effect of sample size in the data set. We use the code and data provided within Repurpose framework [8] and train a logistic regression classifier to predict DDIs using drug chemical, target and side effect similarity calculated via a k-nearest-neighbor approach (k = 20, see [8] for details...