SUMMARY
Computational algorithms for selecting subsets of regression variables are discussed. Only linear models and the least‐squares criterion are considered. The use of planar‐rotation algorithms, instead of Gauss–Jordan methods, is advocated. The advantages and disadvantages of a number of “cheap” search methods are described for use when it is not feasible to carry out an exhaustive search for the best‐fitting subsets.
Hypothesis testing for three purposes is considered, namely (i) testing for zero regression coefficients for remaining variables, (ii) comparing subsets and (iii) testing for any predictive value in a selected subset. Three small data sets are used to illustrate these tests. Spj⊘tvoll's (1972a) test is discussed in detail, though an extension to this test appears desirable.
Estimation problems have largely been overlooked in the past. Three types of bias are identified, namely that due to the omission of variables, that due to competition for selection and that due to the stopping rule. The emphasis here is on competition bias, which can be of the order of two or more standard errors when coefficients are estimated from the same data as were used to select the subset. Five possible ways of handling this bias are listed. This is the area most urgently requiring further research.
Mean squared errors of prediction and stopping rules are briefly discussed. Competition bias invalidates the use of existing stopping rules as they are commonly applied to try to produce optimal prediction equations.