The Standards for Educational and Psychological Testing (1985) recommended that test publishers pro vide multiple estimates of the standard error of mea surement—one estimate for each of a number of widely spaced score levels. The presumption is that the standard error varies across score levels, and that the interpretation of test scores should take into ac count the estimate applicable to the specific level of the examinee. This study compared five methods of estimating conditional standard errors. All five of the methods yielded a maximum value close to the middle of the score scale, with a sharp decline occurring near the extremes of the scale. These trends probably char acterize the raw score standard error of most standard ized achievement and ability tests. Other types of tests, constructed using alternative principles, might well exhibit different trends, however. Two methods of estimation were recommended: an approach based on polynomial smoothing of point estimates suggested by Thorndike (1951) for specific score levels and a modification proposed by Keats (1957) for the error variance derived under the binomial error model of Lord (1955).
The extensive computer simulation work done in developing the computer adaptive versions of the Graduate Record Examinations (GRE) Board General Test and the College Board Admissions Testing Program (ATP) SAT is described in this report. Both the GRE General and SAT computer adaptive tests (CATs), which are fixed length in nature, were developed from pools of items that were calibrated using the three‐parameter logistic IRT model and item selection was based on the recently developed weighted deviations algorithm (see Swanson and Stocking, 1992), which simultaneously deals with content, statistical, and other constraints in the item selection process. For the GRE General CATs (Verbal, Quantitative, and Analytical), item exposure was controlled by using an extension of an approach originally developed by Sympson and Hetter (1988). For the SAT CATs (Verbal and Mathematical), item exposure was controlled by using a less complex randomization approach. Lengths of the CATs were determined so that CAT reliabilities matched or exceeded comparable full length paper‐and‐pencil test reliabilities.
This report summarizes the results from two studies. The first study assessed the comparability of scores derived from linear computer-based (CBT) and computer adaptive (CAT) versions of the three GRE General Test measures. The verbal and quantitative CATs were found to produce scores that were comparable to their CBT counterparts. However, the analytical CAT produced scores that were judged not to be comparable to the analytical CBT scores. As a result, a second study was performed to further examine the analytical measure to ascertain the extent of the lack of comparability and to obtain statistics that would permit adjustments to restore comparability.Results of the additional study of the analytical measure indicated that the differences in analytical CAT and CBT scores due to the testing paradigm were large enough to require an adjustment in scores. Therefore, in order to enhance the comparability of analytical CAT and CBT scores, the analytical CAT was equated to the analytical CBT. This equating provided new analytical CAT conversions that resulted in comparable analytical CAT and CBT scores.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.