The Rater Bundle Model

Wilson, Mark; Hoskens, Machteld

doi:10.3102/10769986026003283

Cited by 51 publications

(59 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Rater severity effects have been included in some measurement models and studied by DeCarlo, Kim, and Johnson (2011) ;Donoghue, McClellan, and Gladkova (2006); Engelhard (2002); Longford (1995) ;Patz, Junker, Johnson, and Mariano (2002); Wilson and Hoskens (2001); and Wolfe and Myford (1997). The results from their studies indicate that the bias or systematic error may be caused by varying degrees of rater leniency or strictness.…”

mentioning

confidence: 99%

“…There are a few research studies incorporating rater severity effects into the item response theory (IRT) models (Donoghue et al, 2006;Engelhard, 1996;Patz, 1997;Patz & Junker, 1999;Wilson & Hoskens, 2001). The FACETS model (Linacre, 1991) is the IRT model that allows for the estimation of differences in severity between raters, and thus eliminates rater bias from the estimates of the items and examinees' ability.…”

mentioning

confidence: 99%

“…Although there are possible mitigations of the problem under certain scenarios involving many examinees, prompts, and raters by use of techniques proposed by Haberman (1977), the solution to the problem of joint estimation generally involves the use of marginal maximum likelihood or conditional maximum likelihood. Other IRT models are also available for modeling rater effects, such as Muraki's rater effect models (1993), the hierarchical rater model (HRM model; Patz, 1997), and the rater bundle model (Wilson & Hoskens, 2001). …”

mentioning

confidence: 99%

See 2 more Smart Citations

The Effects of Rater Severity and Rater Distribution on Examinees' Ability Estimation for Constructed‐response Items

Wang

Yao²

2013

ETS Research Report Series

View full text Add to dashboard Cite

Since its 1947 founding, ETS has conducted and disseminated scientific research to support its products and services, and to advance the measurement and education fields. In keeping with these goals, ETS is committed to making its research freely available to the professional community and to the general public. Published accounts of ETS research, including papers in the ETS Research Report series, undergo a formal peer-review process by ETS staff to ensure that they meet established scientific and professional standards. All such ETS-conducted peer reviews are in addition to any reviews that outside organizations may provide as part of their own publication processes. i AbstractThe current study used simulated data to investigate the properties of a newly proposed method (Yao's rater model) for modeling rater severity and its distribution under different conditions. Our study examined the effects of rater severity, distributions of rater severity, the difference between item response theory (IRT) models with rater effect and without rater effect, and the difference between the precision of the ability estimates for tests composed of only constructed-response (CR) items and for tests composed of multiple-choice (MC) and CR items combined. Our results indicate that rater severity and its distribution can increase the bias of examinees' ability estimates and lower test reliability. Moreover, using an IRT model with rater effects can substantially increase the precision in the examinees' ability estimates, especially when the test was composed of only CR items. We also compared Yao's rater model with Muraki's rater effect model (1993) in terms of ability estimation accuracy and rater parameter recovery. The estimation results from Yao's rater model using Markov chain Monte Carlo (MCMC) were better than those from Muraki's rater effect model using marginal maximum likelihood.

show abstract

mentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

The Effects of Rater Severity and Rater Distribution on Examinees' Ability Estimation for Constructed‐response Items

Wang

Yao²

2013

ETS Research Report Series

View full text Add to dashboard Cite

show abstract

“…Viewed more generally, methods developed in this article extend existing statistical methodology for the analysis of multirater ordinal data (4-7) and item response data (8)(9)(10)(11)(12) to provide a framework for the analysis of panel rating data collected by using the Delphi method and related interactive rating schemes (13).…”

mentioning

confidence: 99%

Statistical analysis of the National Institutes of Health peer review system

Johnson

2008

Proc. Natl. Acad. Sci. U.S.A.

View full text Add to dashboard Cite

A statistical model is proposed for the analysis of peer-review ratings of R01 grant applications submitted to the National Institutes of Health. Innovations of this model include parameters that reflect differences in reviewer scoring patterns, a mechanism to account for the transfer of information from an application's preliminary ratings and group discussion to final ratings provided by all panel members and posterior estimates of the uncertainty associated with proposal ratings. Application of this model to recent R01 rating data suggests that statistical adjustments to panel rating data would lead to a 25% change in the pool of funded proposals. Viewed more broadly, the methodology proposed in this article provides a general framework for the analysis of data collected interactively from expert panels through the use of the Delphi method and related procedures.hierarchical model ͉ item response model ͉ latent variable model ͉ ordinal data E very year, the National Institutes of Health (NIH) spend more than $22 billion to fund scientific research (1). Approximately 70% of these funds are awarded through a peerreview process overseen by the NIH Center for Scientific Review (CSR). Despite the vast sum of money involved, the absence of statistical methodology appropriate for the analyses of peerreview scores generated by this system has precluded the type of detailed assessment applied to other national health and educational systems (2, 3). As a consequence, statistical adjustments to account for uncertainties and biases inherent to these scores are not made before funding decisions. To address this deficiency, this article examines the properties of these ratings and proposes methodology to more efficiently extract the information contained in them.It is useful to begin with a brief review of the NIH peer-review system. Upon submission to the NIH, most grant applications (e.g., R01, R03, R21, etc.) are assigned to a study section within an Integrated Review Group (IRG) for review, and to an NIH Institute and Center (IC) for eventual funding. IRG study sections typically contain Ϸ30 members and review Ϸ50 grant applications (proposals) during each of three annual meetings. Because it is impractical for every member of a study section to review every application, between two and five reviewers are typically assigned to read and score each application before the study section convenes. In the sequel, these individuals are called the proposal's ''readers,'' and the scores they assign before a study section convenes are called ''pre-scores.'' Proposals are scored on a 1.0-5.0 scale in increments of 0.1 units, with 1.0 representing the best score. When the study section convenes, the scientific review officer (SRO) and the study section chair suggest a list of proposals that might be ''streamlined.'' Based on their pre-scores, proposals on this list are viewed as unlikely to receive fundable priority scores and, if no one in the study section objects, are not considered further. The remaining proposals are discusse...

show abstract

“…For example, the test specifications might call for a task of moderate difficulty at site S, with the constraint that the task was not previously exposure at site S. The parameters from the across-site, task-only model might be interpreted as preliminary estimates, to be used for initial task selection. These initial parameter estimates would be then updated using the within-site, taskrater model after the task was used for a period of time at site S. Parameter estimates for task-raters can be periodically evaluated to examine rater leniency and discrimination parameters over time, for example, to delineate rater drift (Wilson & Hoskens, 2001;Harik et al, 2009), to evaluate the accuracy of equating processes and to identify gaps in the task pool. Availability of rater generic and rater specific task and test statistics would help monitor rater performances and spot problematic raters.…”

Section: Discussionmentioning

confidence: 99%

Modeling the Psychometric Properties of Complex Performance Assessment Tasks Using Confirmatory Factor Analysis: A Multistage Model for Calibrating Tasks

Kahraman¹,

Champlain²,

Raymond³

2012

Applied Measurement in Education

View full text Add to dashboard Cite

Item-level information, such as difficulty and discrimination are invaluable to the test assembly, equating, and scoring practices. Estimating these parameters within the context of large-scale performance assessments is often hindered by the use of unbalanced designs for assigning examinees to tasks and raters because such designs result in very sparse data matrices. This article addresses some of the issues using a multistage confirmatory factor analytic approach. The approach is illustrated using data from a performance test in medicine for which examinees encounter multiple patients with medical problems (tasks), with each problem portrayed by a different trained patient (rater). A series of models was fit to rating data (1) to obtain alternative task difficulty and discrimination parameters and (2) to evaluate the observed improvement in the goodness of model fit due to accounted rater and test site effects. The results suggest that availability of alternative task parameter estimates can be useful in practice for making decisions related to task banking, rater training, and test assembly.

show abstract

The Rater Bundle Model

Cited by 51 publications

References 20 publications

The Effects of Rater Severity and Rater Distribution on Examinees' Ability Estimation for Constructed‐response Items

The Effects of Rater Severity and Rater Distribution on Examinees' Ability Estimation for Constructed‐response Items

Statistical analysis of the National Institutes of Health peer review system

Modeling the Psychometric Properties of Complex Performance Assessment Tasks Using Confirmatory Factor Analysis: A Multistage Model for Calibrating Tasks

Contact Info

Product

Resources

About