Reflections on Analytical Choices in the Scaling Model for Test Scores in International Large-Scale Assessment Studies

Robitzsch, Alexander; Lüdtke, Oliver

doi:10.31234/osf.io/pkjth

Cited by 12 publications

(36 citation statements)

References 94 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We believe that the call for controlling for test-taking behavior in the reporting in large-scale assessment studies such as response propensity [3] using models that also include response times [87,88] poses a threat to validity because results can be simply manipulated by instructing students to omit items they do not know [20]. Notably, missing item responses are mostly omissions for CR items.…”

Section: Discussionmentioning

confidence: 99%

“…In the literature, it is frequently argued that missing item responses should never be scored as incorrect [3,7,11,27]. However, we think that the arguments against the incorrect scoring are flawed, and simulation studies cannot show the inadequacy of the UW model (see [19][20][21]).…”

Section: Scoring Missing Item Responses As Wrongmentioning

confidence: 99%

“…The imputation models discussed above are based on unidimensional or two-dimensional IRT models. Posing such a dimensionality reduction might result in invalid imputations because almost all IRT models in large-scale assessment studies are misspecified [20]. Hence, two alternative imputation models for missing item responses were considered that relied on fully conditional specification (FCS; [32]) implemented in the R package mice [60].…”

Section: Imputation Models Based On Fully Conditional Specificationmentioning

confidence: 99%

“…As an alternative, multiple imputation at the level of items can be employed to handle missing item responses properly [16,17]. However, the scoring of missing item responses as wrong has been defended for validity reasons [18][19][20]. Moreover, simulation studies cannot inform about the proper treatment of missing item responses [19,21].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

On the Treatment of Missing Item Responses in Educational Large-scale Assessment Data: The Case of PISA 2018 Mathematics

Robitzsch¹

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Missing item responses are prevalent in educational large-scale assessment studies like the programme for international student assessment (PISA). The current operational practice scores missing item responses as wrong, but several psychometricians advocated a model-based treatment based on latent ignorability assumption. In this approach, item responses and response indicators are jointly modeled conditional on a latent ability and a latent response propensity variable. Alternatively, imputation-based approaches can be used. The latent ignorability assumption is weakened in the Mislevy-Wu model that characterizes a nonignorable missingness mechanism and allows the missingness of an item to depend on the item itself. The scoring of missing item responses as wrong and the latent ignorable model are submodels of the Mislevy-Wu model. This article uses the PISA 2018 mathematics dataset to investigate the consequences of different missing data treatments on country means. Obtained country means can substantially differ for the different scaling models. In contrast to previous statements in the literature, the scoring of missing item responses as incorrect provided a better model fit than a latent ignorable model for most countries. Furthermore, the dependence of the missingness of an item from the item itself after conditioning on the latent response propensity was much more pronounced for constructed-response items than for multiple-choice items. As a consequence, scaling models that presuppose latent ignorability should be refused from two perspectives. First, the Mislevy-Wu model is preferred over the latent ignorable model for reasons of model fit. Second, we argue that model fit should only play a minor role in choosing psychometric models in large-scale assessment studies because validity aspects are most relevant. Missing data treatments that countries can simply manipulate (and, hence, their students) result in unfair country comparisons.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Scoring Missing Item Responses As Wrongmentioning

confidence: 99%

Section: Imputation Models Based On Fully Conditional Specificationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

On the Treatment of Missing Item Responses in Educational Large-scale Assessment Data: The Case of PISA 2018 Mathematics

Robitzsch¹

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Probably in the largest part of the literature, DIF effects are considered as fixed (e.g., Kopf et al 2015b). In this case, the condition for balanced DIF replaces the expected value by the mean associated with the fixed item parameters (Robitzsch and Lüdtke 2021a). There is no additional uncertainty introduced in the estimation of group differences with fixed DIF effects because the item parameters are held fixed in repeated sampling.…”

Section: Differential Item Functioningmentioning

confidence: 99%

Robust and Nonrobust Linking of Two Groups for the Rasch Model with Balanced and Unbalanced Random DIF: A Comparative Simulation Study and the Simultaneous Assessment of Standard Errors and Linking Errors with Resampling Techniques

Robitzsch¹

2021

Preprint

Self Cite

View full text Add to dashboard Cite

In this article, the Rasch model is used for assessing a mean difference between two groups for a test of dichotomous items. It is assumed that random differential item functioning (DIF) exists that has the potential to bias group differences. The case of balanced DIF is distinguished from the case of unbalanced DIF. In balanced DIF, DIF effects cancel out on average. In contrast, in unbalanced DIF, the expected value of DIF effects can differ from zero and favors a particular group on average. Robust linking methods (e.g., invariance alignment) aim at determining group mean differences that are robust to the presence of DIF. In contrast, group differences obtained from nonrobust linking methods (e.g., Haebara linking) can be affected by the presence of a few DIF effects. Alternative robust and nonrobust linking methods are compared in a simulation study under various simulation conditions. It turned out that robust linking methods are preferred over nonrobust alternatives in the case of unbalanced DIF effects. Moreover, M-estimation theory is used for studying the asymptotic behavior of linking estimators if the number of items tends to infinity. These results give insights into asymptotic bias and the estimation of linking errors that represent the variability in estimates due to selecting items in a test. Moreover, M-estimation theory is also used in an analytical treatment to assess standard errors and linking errors simultaneously. Finally, double jackknife and double half sampling methods are introduced and evaluated in a simulation study to assess standard errors and linking errors simultaneously. Half sampling outperformed jackknife estimators for the assessment of variability of estimates from robust linking methods.

show abstract

Evaluating the effects of analytical decisions in large-scale assessments: analyzing PISA mathematics 2003-2012

Heine

Robitzsch

2022

Large-scale Assess Educ

Self Cite

View full text Add to dashboard Cite

Research question This paper examines the overarching question of to what extent different analytic choices may influence the inference about country-specific cross-sectional and trend estimates in international large-scale assessments. We take data from the assessment of PISA mathematics proficiency from the four rounds from 2003 to 2012 as a case study. Methods In particular, four key methodological factors are considered as analytical choices in the rescaling and analysis of the data: (1) The selection of country sub-samples for item calibration differing at three factor levels. (2) The item sample refering to two sets of mathematics items used within PISA. (3) The estimation method used for item calibration: marginal maximum likelihood estimation method as implemented in R package TAM or an pairwise row averaging approach as implemented in the R package pairwise. (4) The type of linking method: concurrent calibration or separate calibration with successive chain linking. Findings It turned out that analytical decisions for scaling did affect the PISA outcomes. The factors of choosing different calibration samples, estimation method and linking method tend to show only small effects on the country-specific cross-sectional and trend estimates. However, the selection of different link items seems to have a decisive influence on country ranking and development trends between and within countries.

show abstract

Reflections on Analytical Choices in the Scaling Model for Test Scores in International Large-Scale Assessment Studies

Cited by 12 publications

References 94 publications

On the Treatment of Missing Item Responses in Educational Large-scale Assessment Data: The Case of PISA 2018 Mathematics

On the Treatment of Missing Item Responses in Educational Large-scale Assessment Data: The Case of PISA 2018 Mathematics

Robust and Nonrobust Linking of Two Groups for the Rasch Model with Balanced and Unbalanced Random DIF: A Comparative Simulation Study and the Simultaneous Assessment of Standard Errors and Linking Errors with Resampling Techniques

Evaluating the effects of analytical decisions in large-scale assessments: analyzing PISA mathematics 2003-2012

Contact Info

Product

Resources

About