Analysis of Score Change Patterns of Examinees Repeating the Graduate Record Examinations General Test1

With self-reported but empirically verified repeater groups, we analyzed vast amount of real test data across a wide range of administrations from a graduate admissions examination that was administered in a non-English language to investigate repeater effects on score equating using the non-equivalent groups anchor test (NEAT) design.Both linear and non-linear equating models were considered in deriving the equating functions for various study groups. We evaluated scaled score differences between equatings in the total group, the repeater group and the first-timer group using statistics of simple differences and subpopulation invariance measures developed and used widely in the last 10 years. Standard errors of statistics summarizing scaled score differences were estimated using a simulation approach to provide statistical criteria for evaluating the significance of equating differences. In addition, we used scaled score differences that were critical to admissions screening as criteria for evaluating practical significance of equating differences. To put the investigation of repeater effects in proper perspective, we analyzed the repeater data for an in-depth understanding of repeater performance trends.Overall, we found no significant effects of repeater performance on score equating for the study exam. Although many of the equating differences were practically significant, most of the practically significant differences were not statistically significant. However, further research with larger repeater samples was recommended to help explain the practical significance of equating differences consistently observed in this study for the repeater group. Potential problems associated with small repeater study sample sizes, issues with the practical criterion for evaluating the significance of equating differences and study limitations were also discussed.

Section: Introductionmentioning

confidence: 99%

Repeater Effects on Score Equating for a Graduate Admissions Exam

Yang

Bontya

Moses

2011

“…This baseline includes 1,014 examinees in the fall 1991 CBT field test who took a P&P form at a national administration and then returned several weeks later and took a different form delivered as a CBT. Kingston and Turner (1984). Some of the data from these repeaters were not available and therefore data from these repeaters were not included in some of the comparisons.…”

mentioning

confidence: 99%

The Introduction and Comparability of the Computer Adaptive Gre General Test

Schaeffer¹,

Steffen²,

Golub-Smith³

et al. 1995

This report summarizes the results from two studies. The first study assessed the comparability of scores derived from linear computer-based (CBT) and computer adaptive (CAT) versions of the three GRE General Test measures. The verbal and quantitative CATs were found to produce scores that were comparable to their CBT counterparts. However, the analytical CAT produced scores that were judged not to be comparable to the analytical CBT scores. As a result, a second study was performed to further examine the analytical measure to ascertain the extent of the lack of comparability and to obtain statistics that would permit adjustments to restore comparability.Results of the additional study of the analytical measure indicated that the differences in analytical CAT and CBT scores due to the testing paradigm were large enough to require an adjustment in scores. Therefore, in order to enhance the comparability of analytical CAT and CBT scores, the analytical CAT was equated to the analytical CBT. This equating provided new analytical CAT conversions that resulted in comparable analytical CAT and CBT scores.

“…This might occur because of violations of the assumptions of the equating model. In particular, to some extent examinees are advantaged if they have previously taken the same edition of a test (Kingston & Turner, 1984). This can occur for the old edition, but not for the new edition, in an RG equating.…”

Section: Factors That May Have Affected These Resultsmentioning

confidence: 99%

Alternative Methods of Equating the Gre General Test

Kingston¹,

Holland²

1986

Financial support from the Graduate Record Examinations Board and Educational Testing Service is gratefully acknowledged. Our thanks to our many colleagues who made this research possible. The original plan for this research was designed by E. Elizabeth Stewart with the assistance of Madeline Wallmark and several other consultants. Many of the analyses were supervised or performed by Madeline Wallmark. Dorothy Thayer, and Craig Mills. We are especially grateful for the organizational and programming assistance of Louann Benton. We also thank Frederic Lord. Martha Stocking, and Marilyn Wingersky with whom we consulted many times and several colleagues who reviewed an earlier draft of this paper. We wish to thank especially E. Elizabeth Stewart for her insightful comments. Nonetheless, the opinions expressed herein are solely those of the authors. ABSTRACTThe original purpose of this study was to address the test-disclosure-related need to introduce more Graduate Record Examinations (GRE) General Test editions each year than formerly, in a context of stable, or possibly declining examinee volume. The legislative conditions that created this initial concern regarding test equating have abated. However, several of the test equating models considered in this research might provide other advantages to the GRE Program. These potential advantages are listed in the body of the report.Equating can be considered to consist of three parts: (1) a data collection design, (2) an operational definition of the equating transformation, and (3) the specific statistical estimation techniques used to obtain the equating transformation. Currently, the GRE General Test collects data using an equivalent groups design. Typically, a linear equating method is used, and the specific estimation technique is setting means and standard deviations equal.For this research, two other data collection designs Were studied: nonrandom group, external anchor test, and random group, preoperational section. Both item response theory (IRT) and linear equating definitions were used. IRT true score equating was based on item statistics for the three-parameter logistic model as estimated using LOGIST. Linear models included section pre-equating using the EM algorithm, Tucker's observed score model, and several true score models developed by Tucker and Levine. For each of the three GRE measures, verbal, quantitative and analytical, all equating methods were assessed for bias and root mean squared error by equating a test edition to itself through a chain with six equating links.Bias and root mean squared error were extremely large for equating the verbal and analytical measures using section pre-equating or IRT equating with data based on the random group preoperational section data collection design. For the quantitative measure, this data collection design produced a small amount of bias, but moderate amount of root mean squared error.Using the nonrandom group, external anchor test data collection design t quantitative equatings had moderate amounts of bo...