Financial support from the Graduate Record Examinations Board and Educational Testing Service is gratefully acknowledged. Our thanks to our many colleagues who made this research possible. The original plan for this research was designed by E. Elizabeth Stewart with the assistance of Madeline Wallmark and several other consultants. Many of the analyses were supervised or performed by Madeline Wallmark. Dorothy Thayer, and Craig Mills. We are especially grateful for the organizational and programming assistance of Louann Benton. We also thank Frederic Lord. Martha Stocking, and Marilyn Wingersky with whom we consulted many times and several colleagues who reviewed an earlier draft of this paper. We wish to thank especially E. Elizabeth Stewart for her insightful comments. Nonetheless, the opinions expressed herein are solely those of the authors.
ABSTRACTThe original purpose of this study was to address the test-disclosure-related need to introduce more Graduate Record Examinations (GRE) General Test editions each year than formerly, in a context of stable, or possibly declining examinee volume. The legislative conditions that created this initial concern regarding test equating have abated. However, several of the test equating models considered in this research might provide other advantages to the GRE Program. These potential advantages are listed in the body of the report.Equating can be considered to consist of three parts: (1) a data collection design, (2) an operational definition of the equating transformation, and (3) the specific statistical estimation techniques used to obtain the equating transformation. Currently, the GRE General Test collects data using an equivalent groups design. Typically, a linear equating method is used, and the specific estimation technique is setting means and standard deviations equal.For this research, two other data collection designs Were studied: nonrandom group, external anchor test, and random group, preoperational section. Both item response theory (IRT) and linear equating definitions were used. IRT true score equating was based on item statistics for the three-parameter logistic model as estimated using LOGIST. Linear models included section pre-equating using the EM algorithm, Tucker's observed score model, and several true score models developed by Tucker and Levine. For each of the three GRE measures, verbal, quantitative and analytical, all equating methods were assessed for bias and root mean squared error by equating a test edition to itself through a chain with six equating links.Bias and root mean squared error were extremely large for equating the verbal and analytical measures using section pre-equating or IRT equating with data based on the random group preoperational section data collection design. For the quantitative measure, this data collection design produced a small amount of bias, but moderate amount of root mean squared error.Using the nonrandom group, external anchor test data collection design t quantitative equatings had moderate amounts of bo...