Given activity training data from Hight-Throughput Screening (HTS) experiments, virtual HighThroughput Screening (vHTS) methods aim to predict in silico the activity of untested chemicals. We present a novel method, the Influence Relevance Voter (IRV), specifically tailored for the vHTS task. The IRV is a low-parameter neural network which refines a k-nearest neighbor classifier by non-linearly combining the influences of a chemical's neighbors in the training set. Influences are decomposed, also non-linearly, into a relevance component and a vote component.The IRV is benchmarked using the data and rules of two large, open, competitions, and its performance compared to the performance of other participating methods, as well as of an in-house Support Vector Machine (SVM) method. On these benchmark datasets, IRV achieves state-of-theart results, comparable to the SVM in one case, and significantly better than the SVM in the other, retrieving three times as many actives in the top 1% of its prediction-sorted list.The IRV presents several other important advantages over SVMs and other methods: (1) the output predictions have a probabilistic semantic; (2) the underlying inferences are interpretable; (3) the training time is very short, on the order of minutes even for very large data sets; (4) the risk of overfitting is minimal, due to the small number of free parameters; and (5) additional information can easily be incorporated into the IRV architecture. Combined with its performance, these qualities make the IRV particularly well suited for vHTS.Virtual High-Throughput Screening (vHTS) is the cost-effective, in silico complement of experimental HTS. A vHTS algorithm uses data from HTS experiments to predict the activity of new sets of compounds in silico. Although vHTS is sometimes cast as a classification task, it is more appropriately described as a ranking task, where the goal is to rank additional compounds, such that active compounds are close to the the top of the prediction-sorted list as possible. The experiments required to verify a hit are expensive, so it is critical that true actives be recognized as early as possible. Accurately ordering actives by their degree of activity, however, is not critical. The vHTS task, therefore, differs from the `ranking' task of the machine learning literature, in that the goal is not to precisely order the chemicals in relation to each other, but rather to globally rank as many actives as possible above the bulk of the inactives. Furthermore, proper vHTS training data for the ranking task, is often unavailable. E-mail: sswamida@ics.uci
NIH-PA Author ManuscriptNIH-PA Author Manuscript
NIH-PA Author ManuscriptAn important algorithm proposed for vHTS is the k-Nearest Neighbor (kNN) classifier, a nonparametric method which has been shown effective in a number of other problems. 1,2 In the kNN approach, each new data point is classified by integrating information from its neighborhood in the training set in a very simple way. Specifically, a new data point is assigne...