Screening measures are used in psychology and medicine to identify respondents who are high or low on a construct. Based on the screening, the evaluator assigns respondents to classes corresponding to different courses of action: Make a diagnosis versus reject a diagnosis; provide services versus withhold services; or conduct further assessment versus conclude the assessment process. When measures are used to classify individuals, it is important that the decisions be consistent and equitable across groups. Ideally, if respondents completed the screening measure repeatedly in quick succession, they would be consistently assigned into the same class each time. In addition, the consistency of the classification should be unrelated to the respondents' background characteristics, such as sex, race, or ethnicity (i.e., the measure is free of measurement bias). Reporting estimates of classification consistency is a common practice in educational testing, but there has been limited application of these estimates to screening in psychology and medicine. In this article, we present two procedures based on item response theory that are used (a) to estimate the classification consistency of a screening measure and (b) to evaluate how classification consistency is impacted by measurement bias across respondent groups. We provide R functions to conduct the procedures, illustrate the procedures with real data, and use Monte Carlo simulations to guide their appropriate use. Finally, we discuss how estimates of classification consistency can help assessment specialists make more informed decisions on the use of a screening measure with protected groups (e.g., groups defined by gender, race, or ethnicity).
Public Significance StatementIdeally, measures used to classify individuals in psychology and medicine (e.g., as depressed vs. not depressed) produce consistent decisions that do not depend on extraneous features of the individuals (e.g., sex, race, or ethnicity). We provide methods to estimate how consistent the decisions based on a measure are and study whether the consistency varies across groups (e.g., sex, race, or ethnicity).