Score equating is essential for any testing program that continually produces new editions of a test and for which the expectation is that scores from these editions have the same meaning over time. Particularly in testing programs that help make high‐stakes decisions, it is extremely important that test equating be done carefully and accurately. An error in the equating function or score conversion can affect the scores for all examinees, which is both a fairness and a validity concern. Because the reported score is so visible, the credibility of a testing organization hinges on activities associated with producing, equating, and reporting scores. This paper addresses the practical implications of score equating by describing aspects of equating and best practices associated with the equating process.
The planned introduction of a computer-based Test of English as a Foreign Language (TOEFL) test raises concerns that language proficiency will be confounded with computer proficiency, introducing construct-irrelevant variance to the measurement of examinees' English-language abilities. We administered a questionnaire focusing on examinees' computer familiarity to 90,000 TOEFL test takers. A group of 1,200 "low-computer-familiar" and "highcomputer-familiar" examinees from 12 international sites worked through a computer tutorial and a set of 60 computer-
The extensive computer simulation work done in developing the computer adaptive versions of the Graduate Record Examinations (GRE) Board General Test and the College Board Admissions Testing Program (ATP) SAT is described in this report. Both the GRE General and SAT computer adaptive tests (CATs), which are fixed length in nature, were developed from pools of items that were calibrated using the three‐parameter logistic IRT model and item selection was based on the recently developed weighted deviations algorithm (see Swanson and Stocking, 1992), which simultaneously deals with content, statistical, and other constraints in the item selection process. For the GRE General CATs (Verbal, Quantitative, and Analytical), item exposure was controlled by using an extension of an approach originally developed by Sympson and Hetter (1988). For the SAT CATs (Verbal and Mathematical), item exposure was controlled by using a less complex randomization approach. Lengths of the CATs were determined so that CAT reliabilities matched or exceeded comparable full length paper‐and‐pencil test reliabilities.
The purpose of this instructional module is to provide the basis for understanding the process of score equating through the use of item response theory (IRT). A context is provided for addressing the merits of IRT equating methods. The mechanics of IRT equating and the need to place parameter estimates from separate calibration runs on the same scale are discussed. Some procedures for placing parameter estimates on a common scale are presented. In addition, IRT true‐score equating is discussed in some detail. A discussion of the practical advantages derived from IRT equating is offered at the end of the module.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.