About Still Nonignorable Consequences of (Partially) Ignoring Missing Item Responses in Large-scale Assessment

Robitzsch, Alexander

doi:10.31219/osf.io/hmy45

Cited by 10 publications

(24 citation statements)

References 38 publications

(61 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In our opinion, the possibility to influence students' test-taking behavior poses severe threats to the validity and fairness of country comparisons. Furthermore, in our research with LSA data, we found that the conditional independence assumptions of item responses and response indicators in the SA+O model are strongly violated, resulting in a worse model fit of the SA+O model (see Robitzsch, 2020). There is empirical evidence that students who do not know the answer to an item have a high probability of omitting this item even after controlling for latent variables.…”

Section: The Role Of Test-taking Behavior In the Scaling Modelmentioning

confidence: 77%

Reflections on Analytical Choices in the Scaling Model for Test Scores in International Large-Scale Assessment Studies

Robitzsch¹,

Lüdtke²

2021

Preprint

Self Cite

View full text Add to dashboard Cite

International large-scale assessments (LSAs) such as the Programme for International Student Assessment (PISA) provide important information about the distribution of student proficiencies across a wide range of countries. The repeated assessments of these content domains offer policymakers important information for evaluating educational reforms and received considerable attention from the media. Furthermore, the analytical strategies employed in LSAs often define methodological standards for applied researchers in the field. Hence, it is vital to critically reflect the conceptual foundations of analytical choices in LSA studies. This article discusses methodological challenges in selecting and specifying the scaling model used to obtain proficiency estimates from the individual student responses in LSA studies. We distinguish design-based inference from model-based inference. It is argued that for the official reporting of LSA results, design-based inference should be preferred because it allows for a clear definition of the target of inference (e.g., country mean achievement) and is less sensitive to specific modeling assumptions. More specifically, we discuss five analytical choices in the specification of the scaling model: (1) Specification of the functional form of item response functions, (2) the treatment of local dependencies and multidimensionality, (3) the consideration of test-taking behavior for estimating student ability, and the role of country differential items functioning (DIF) for (4) cross-country comparisons, and (5) trend estimation. This article's primary goal is to stimulate discussion about recently implemented changes and suggested refinements of the scaling models in LSA studies.

show abstract

Section: The Role Of Test-taking Behavior In the Scaling Modelmentioning

confidence: 77%

Reflections on Analytical Choices in the Scaling Model for Test Scores in International Large-Scale Assessment Studies

Robitzsch¹,

Lüdtke²

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…In the literature, it is frequently argued that missing item responses should never be scored as incorrect [3,7,11,27]. However, we think that the arguments against the incorrect scoring are flawed, and simulation studies cannot show the inadequacy of the UW model (see [19][20][21]).…”

Section: Scoring Missing Item Responses As Wrongmentioning

confidence: 99%

“…In this model, the probability of responding to an item depends on the latent response propensity ξ p and the item response X pi itself (see [18,19,30,[56][57][58]). Model MM1 is defined by assuming a common δ i parameter for all items.…”

Section: Mislevy-wu Model For Nonignorable Item Responsesmentioning

confidence: 99%

“…As an alternative, multiple imputation at the level of items can be employed to handle missing item responses properly [16,17]. However, the scoring of missing item responses as wrong has been defended for validity reasons [18][19][20]. Moreover, simulation studies cannot inform about the proper treatment of missing item responses [19,21].…”

Section: Introductionmentioning

confidence: 99%

“…However, the scoring of missing item responses as wrong has been defended for validity reasons [18][19][20]. Moreover, simulation studies cannot inform about the proper treatment of missing item responses [19,21].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

On the Treatment of Missing Item Responses in Educational Large-scale Assessment Data: The Case of PISA 2018 Mathematics

Robitzsch¹

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Missing item responses are prevalent in educational large-scale assessment studies like the programme for international student assessment (PISA). The current operational practice scores missing item responses as wrong, but several psychometricians advocated a model-based treatment based on latent ignorability assumption. In this approach, item responses and response indicators are jointly modeled conditional on a latent ability and a latent response propensity variable. Alternatively, imputation-based approaches can be used. The latent ignorability assumption is weakened in the Mislevy-Wu model that characterizes a nonignorable missingness mechanism and allows the missingness of an item to depend on the item itself. The scoring of missing item responses as wrong and the latent ignorable model are submodels of the Mislevy-Wu model. This article uses the PISA 2018 mathematics dataset to investigate the consequences of different missing data treatments on country means. Obtained country means can substantially differ for the different scaling models. In contrast to previous statements in the literature, the scoring of missing item responses as incorrect provided a better model fit than a latent ignorable model for most countries. Furthermore, the dependence of the missingness of an item from the item itself after conditioning on the latent response propensity was much more pronounced for constructed-response items than for multiple-choice items. As a consequence, scaling models that presuppose latent ignorability should be refused from two perspectives. First, the Mislevy-Wu model is preferred over the latent ignorable model for reasons of model fit. Second, we argue that model fit should only play a minor role in choosing psychometric models in large-scale assessment studies because validity aspects are most relevant. Missing data treatments that countries can simply manipulate (and, hence, their students) result in unfair country comparisons.

show abstract

What modulates the acquisition of difficult structures in a heritage language? A study on Portuguese in contact with French, German and Italian

2022

View full text Add to dashboard Cite

Several studies on heritage language (HL) acquisition investigate a single linguistic structure, showing how language exposure or cross-linguistic effects affect its acquisition. Here, we consider HL speaking children's mastery of several linguistic structures using a cloze-test. We examine how their language competence is affected by language exposure variables and age. We tested 180 children between the ages of 8 and 16, living in Switzerland and speaking European Portuguese as HL and French, German or Italian as their societal language. The items of the cloze-test cluster around two levels of difficulty, with the items at the second level corresponding to structures that are acquired late in Portuguese monolingual acquisition. Older age and a greater amount of formal instruction in the HL lead to better performance. The role of the amount of formal instruction varies based on the level of difficulty of the target structures. Cross-linguistic influence does not affect the results.

show abstract

About Still Nonignorable Consequences of (Partially) Ignoring Missing Item Responses in Large-scale Assessment

Cited by 10 publications

References 38 publications

Reflections on Analytical Choices in the Scaling Model for Test Scores in International Large-Scale Assessment Studies

Reflections on Analytical Choices in the Scaling Model for Test Scores in International Large-Scale Assessment Studies

On the Treatment of Missing Item Responses in Educational Large-scale Assessment Data: The Case of PISA 2018 Mathematics

What modulates the acquisition of difficult structures in a heritage language? A study on Portuguese in contact with French, German and Italian

Contact Info

Product

Resources

About