Impact of IRT item misfit on score estimates and severity classifications: an examination of PROMIS depression and pain interference item banks

Zhao, Yue

doi:10.1007/s11136-016-1467-3

Cited by 12 publications

(13 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We further examined the consequence of item misfit on the item and person parameter estimates and found that either including or excluding the nine items yielded nearly identical results. Therefore, as we considered the consequence minor and the misfit tolerable [50], we included all items in the outcome score linking.…”

Section: Resultsmentioning

confidence: 99%

“…Such a practice deserves more attention, and it is strongly encouraged in studies on linking PRO measures to ensure the validity of the inferences drawn from the score concordances. Finally, instead of relying solely on chi-square-like IRT fit statistics, which can be sensitive to sample size, we evaluated IRT item misfit by focusing on the consequences of using misfitting items and item statistics associated with them, a strategy strongly recommended by Hambleton and Han [65] and Zhao [50]. We hope that future studies adopting a rigorous approach to addressing methodological issues are encouraged in order to promote the quality of PRO research and to ensure the appropriate application of IRT models.…”

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

Comparing five depression measures in depressed Chinese patients using item response theory: an examination of item properties, measurement precision and score comparability

Zhao

Chan

2017

Health Qual Life Outcomes

Self Cite

View full text Add to dashboard Cite

BackgroundItem response theory (IRT) has been increasingly applied to patient-reported outcome (PRO) measures. The purpose of this study is to apply IRT to examine item properties (discrimination and severity of depressive symptoms), measurement precision and score comparability across five depression measures, which is the first study of its kind in the Chinese context.MethodsA clinical sample of 207 Hong Kong Chinese outpatients was recruited. Data analyses were performed including classical item analysis, IRT concurrent calibration and IRT true score equating. The IRT assumptions of unidimensionality and local independence were tested respectively using confirmatory factor analysis and chi-square statistics. The IRT linking assumptions of construct similarity, equity and subgroup invariance were also tested. The graded response model was applied to concurrently calibrate all five depression measures in a single IRT run, resulting in the item parameter estimates of these measures being placed onto a single common metric. IRT true score equating was implemented to perform the outcome score linking and construct score concordances so as to link scores from one measure to corresponding scores on another measure for direct comparability.ResultsFindings suggested that (a) symptoms on depressed mood, suicidality and feeling of worthlessness served as the strongest discriminating indicators, and symptoms concerning suicidality, changes in appetite, depressed mood, feeling of worthlessness and psychomotor agitation or retardation reflected high levels of severity in the clinical sample. (b) The five depression measures contributed to various degrees of measurement precision at varied levels of depression. (c) After outcome score linking was performed across the five measures, the cut-off scores led to either consistent or discrepant diagnoses for depression.ConclusionsThe study provides additional evidence regarding the psychometric properties and clinical utility of the five depression measures, offers methodological contributions to the appropriate use of IRT in PRO measures, and helps elucidate cultural variation in depressive symptomatology. The approach of concurrently calibrating and linking multiple PRO measures can be applied to the assessment of PROs other than the depression context.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Discussionmentioning

confidence: 99%

Comparing five depression measures in depressed Chinese patients using item response theory: an examination of item properties, measurement precision and score comparability

Zhao

Chan

2017

Health Qual Life Outcomes

Self Cite

View full text Add to dashboard Cite

show abstract

“…The authors reported that violating this assumption had little effect on the calculation of these estimates, but the presence of multidimensionality in the data affected the precision of the estimates. In a clinical setting, Zhao (2017) evaluated the impact of item-level misfit on estimates of the severity of respondents’ depression and estimates of the intensity of respondents’ pain levels, as well as respondents’ classifications within clinical categories derived from these estimates. Zhao observed that item misfit did not have substantial practical consequences that affected estimates of respondents’ locations on the latent variable and classification within clinical categories.…”

Section: Evaluating the Practical Consequences Of The Violation Of Itmentioning

confidence: 99%

Examining the Impacts of Rater Effects in Performance Assessments

Wind

2018

Applied Psychological Measurement

View full text Add to dashboard Cite

Rater effects such as severity, centrality, and misfit are recurrent concerns in performance assessments. Despite their persistence in operational assessment settings and frequent discussion in research, researchers have not fully explored the impacts of rater effects as they relate to estimates of student achievement. The purpose of this study is to explore the impacts of rater severity, centrality, and misfit on student achievement estimates and on classification decisions. The results suggest that these three types of rater effects have substantial impacts on estimates of student achievement and on classification decisions that impact the fairness of rater-mediated assessments. Accordingly, it is essential that researchers and practitioners evaluate ratings across all stages of rater-mediated assessment procedures, including rater training and operational scoring.

show abstract

“…Sinharay and Haberman (2014) studied practical significance of model misfit with various empirical data sets and concluded that the misfit was not always practically significant though evidence of misfit for a substantial number of items was demonstrated. Zhao (2017) investigated the practical impact of item misfit with Patient-Reported Outcome Measurement Information System (PROMIS) depression and pain interference item banks, and suggested that item misfit had a negligible impact on score estimates and severity classifications with the studied sample. Meijer and Tendeiro (2015) analyzed two empirical data sets and examined the effect of removing misfitting items and misfitting item score patterns on the rank order of test takers according to their proficiency level score, and found that the impact of removing misfitting items and item score patterns varied depending on the IRT model applied.…”

Section: Introductionmentioning

confidence: 99%

Practical Consequences of Item Response Theory Model Misfit in the Context of Test Equating with Mixed-Format Test Data

Zhao

Hambleton

2017

Front. Psychol.

Self Cite

View full text Add to dashboard Cite

In item response theory (IRT) models, assessing model-data fit is an essential step in IRT calibration. While no general agreement has ever been reached on the best methods or approaches to use for detecting misfit, perhaps the more important comment based upon the research findings is that rarely does the research evaluate IRT misfit by focusing on the practical consequences of misfit. The study investigated the practical consequences of IRT model misfit in examining the equating performance and the classification of examinees into performance categories in a simulation study that mimics a typical large-scale statewide assessment program with mixed-format test data. The simulation study was implemented by varying three factors, including choice of IRT model, amount of growth/change of examinees’ abilities between two adjacent administration years, and choice of IRT scaling methods. Findings indicated that the extent of significant consequences of model misfit varied over the choice of model and IRT scaling methods. In comparison with mean/sigma (MS) and Stocking and Lord characteristic curve (SL) methods, separate calibration with linking and fixed common item parameter (FCIP) procedure was more sensitive to model misfit and more robust against various amounts of ability shifts between two adjacent administrations regardless of model fit. SL was generally the least sensitive to model misfit in recovering equating conversion and MS was the least robust against ability shifts in recovering the equating conversion when a substantial degree of misfit was present. The key messages from the study are that practical ways are available to study model fit, and, model fit or misfit can have consequences that should be considered when choosing an IRT model. Not only does the study address the consequences of IRT model misfit, but also it is our hope to help researchers and practitioners find practical ways to study model fit and to investigate the validity of particular IRT models for achieving a specified purpose, to assure that the successful use of the IRT models are realized, and to improve the applications of IRT models with educational and psychological test data.

show abstract

Impact of IRT item misfit on score estimates and severity classifications: an examination of PROMIS depression and pain interference item banks

Cited by 12 publications

References 28 publications

Comparing five depression measures in depressed Chinese patients using item response theory: an examination of item properties, measurement precision and score comparability

Comparing five depression measures in depressed Chinese patients using item response theory: an examination of item properties, measurement precision and score comparability

Examining the Impacts of Rater Effects in Performance Assessments

Practical Consequences of Item Response Theory Model Misfit in the Context of Test Equating with Mixed-Format Test Data

Contact Info

Product

Resources

About