The purpose of this study was to compare and evaluate five on‐line pretest item‐calibration/scaling methods in computerized adaptive testing (CAT): marginal maximum likelihood estimate with one EM cycle (OEM), marginal maximum likelihood estimate with multiple EM cycles (MEM), Stocking's Method A, Stocking's Method B, and BILOG/Prior. The five methods were evaluated in terms of item‐parameter recovery, using three different sample sizes (300, 1000 and 3000). The MEM method appeared to be the best choice among these, because it produced the smallest parameter‐estimation errors for all sample size conditions. MEM and OEM are mathematically similar, although the OEM method produced larger errors. MEM also was preferable to OEM, unless the amount of time involved in iterative computation is a concern. Stocking's Method B also worked very well, but it required anchor items that either would increase test lengths or require larger sample sizes depending on test administration design. Until more appropriate ways of handling sparse data are devised, the BILOG/Prior method may not be a reasonable choice for small sample sizes. Stocking's Method A had the largest weighted total error, as well as a theoretical weakness (i.e., treating estimated ability as true ability); thus, there appeared to be little reason to use it.
This module discusses the 1‐, 2‐, and 3‐parameter logistic item response theory models. Mathematical formulas are given for each model, and comparisons among the three models are made. Figures are included to illustrate the effects of changing the a, b, or c parameter, and a single data set is used to illustrate the effects of estimating parameter values (as opposed to the true parameter values) and to compare parameter estimates achieved though applying the different models. The estimation procedure itself is discussed briefly. Discussions of model assumptions, such as dimensionality and local independence, can be found in many of the annotated references (e.g., Hambleton, 1988).
This paper examines the applicability of traditional, bootstrap, and jackknife methodologies for estimating standard errors and obtaining confidence intervals for the variance components for persons, items, and residuals in a random effects G study p x i design. Principal consideration is given to simulation results with binary data, although some simulation results for normally distributed data are also reported. The simulations suggest that the traditional approach produces accurate results with normally distributed data but poor results with binary data, at least for the variance component for residuals. The jackknife provides quite accurate results for both types of data and for all three variance components. The bootstrap can be "made to work" reasonably well but doing so seems to require several ad hoc procedures for defining bootstrap samples, which renders the bootstrap somewhat less satisfactory than the jackknife for the application considered here.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.