Peer reviewed eScholarship.orgPowered by the California Digital Library University of California largest absolute correlations with the label. However, he or she verifies the correlations (with the label) on the holdout set and uses only those variables whose correlation agrees in sign with the correlation on the training set and for which both correlations are larger than some threshold in absolute value. The analyst then creates a simple linear threshold classifier on the selected variables using only the signs of the correlations of the selected variables. A final test evaluates the classification accuracy of the classifier on the holdout set. Full details of the analyst's algorithm can be found in section 3 of (17). In our first experiment, each attribute is drawn independently from the normal distribution N(0,1), and we choose the class label y ∈ f−1; 1g uniformly at random so that there is no correlation between the data point and its label. We chose n = 10,000 and d = 10,000 and varied the number of selected variables k. In this scenario no classifier can achieve true accuracy better than 50%. Nevertheless, reusing a standard holdout results in reported accuracy of >63 ± 0.4% for k = 500 on both the training set and the holdout set. The average and standard deviation of results obtained from 100 independent executions of the experiment are plotted in Fig. 1A, which also includes the accuracy of the classifier on another fresh data set of size n drawn from the same distribution. We then executed the same algorithm with our reusable holdout. The algorithm Thresholdout was invoked with T = 0.04 and t = 0.01, which explains why the accuracy of the classifier reported by Thresholdout is off by up to 0.04 whenever the accuracy on the holdout set is within 0.04 of the accuracy on the training set. Thresholdout prevents the algorithm from overfitting to the holdout set and gives a valid estimate of classifier accuracy. In Fig. 1B, we plot the accuracy of the classifier as reported by Thresholdout. In addition, in fig. S2 we include a plot of the actual accuracy of the produced classifier on the holdout set.In our second experiment, the class labels are correlated with some of the variables. As before, the label is randomly chosen from {-1,1} and each of the attributes is drawn from N(0,1), aside from 20 attributes drawn from N(y·0.06,1), where y is the class label. We execute the same algorithm on this data with both the standard holdout and Thresholdout and plot the results in Fig. 2. Our experiment shows that when using the reusable holdout, the algorithm still finds a good classifier while preventing overfitting.Overfitting to the standard holdout set arises in our experiment because the analyst reuses the holdout after using it to measure the correlation of single attributes. We first note that neither cross-validation nor bootstrap resolve this issue. If we used either of these methods to validate the correlations, overfitting would still arise as a result of using the same data for training and validation (...