A meta-analysis of 111 interrater reliability coefficients and 49 coefficient alphas from selection interviews was conducted. Moderators of interrater reliability included study design, interviewer training, and 3 dimensions of interview structure (standardization of questions, of response evaluation, and of combining multiple ratings). Interactions showed that standardizing questions had a stronger moderating effect on reliability when coefficients were from separate (rather than panel) interviews, and multiple ratings were useful when combined mechanically (there was no evidence of usefulness when combined subjectively). Average correlations (derived from alphas) between ratings were moderated by standardization of questions and number of ratings made. Upper limits of validity were estimated to be .67 for highly structured interviews and .34 for unstructured interviews.Researchers have long been interested in determining how the reliability and the validity of selection-interview ratings can be improved (e.g., Wagner, 1949). Several recent meta-analyses have examined moderators of interview validity (e.g., Huffcutt & Arthur, 1994;McDaniel, Whetzel, Schmidt, & Maurer, 1994;Weisner & Cronshaw, 1988). These studies have provided useful information, such as the general finding that validity is greater for structured interviews than for unstructured interviews. However, Hakel (1989) proposed another approach to improving interview ratings-studying interview reliability. Hakel suggested a meta-analysis covering interrater coefficients, stabilities over time, and internal consistencies. Note that a reliability coefficient indicating stability over time is an interrater coefficient where the ratings are based on separate interviews (rather than a panel interview) with the same applicant. Therefore, reliabilities can be divided into three groups: two
The definition of halo error that dominated researchers' thinking for most of this century implied that (a) halo error was common; (b) it was a rater error, with true and illusory components; (c) it led to inflated correlations among rating dimensions and was due to the influence of a general evaluation on specific judgments; and (d) it had negative consequences and should be avoided or removed. We review research showing that all of the major elements of this conception of halo are either wrong or problematic. Because of unresolved confounds of true and illusory halo and the often unclear consequences of halo errors, we suggest a moratorium on the use of halo indices as dependent measures in applied research. We suggest specific directions for future research on halo that take into account the context in which judgments are formed and ratings are obtained and that more clearly distinguish between actual halo errors and the apparent halo effect. When an individual is rated on multiple performance dimensions or attributes, the rater's overall impression or evaluation is thought to strongly influence ratings of specific attributes (Cooper, 1981 b), a phenomenon that is referred to as halo error (Thorndike, 1920). Discussions of halo error are most frequently encountered in the context of evaluative judgment (e.g., in interviews and performance appraisals), but similar phenomena have been noted in research on illusory correlation (Chapman & Chapman, 1969), implicit personality theory (Lay & Jackson, 1969), and interpersonal judgments (Nisbett & Wilson, 1977). Research on halo errors in rating can be traced back to the early part of this century (Thorndike, 1920; Wells, 1907). Although there are a number of different conceptual and operational definitions of halo (Balzer & Sulsky, 1992; Saal, Downey, & Lahey, 1980), throughout most of the history of research on halo error, there has been some consensus regarding the nature and consequences of halo error. First, halo error is thought to be common (
One strategy suggested for improving the accuracy of the complex evaluative judgments involved in performance evaluation is to decompose them into a series of simpler judgments. Another is to collect observations in a distributional rating scheme in which raters estimate the frequencies of different classes of behavior and performance is assessed in terms of the relative frequencies of effective and ineffective behaviors. In the present study, we compared distributional ratings to Likert-type ratings of videotaped lectures at 3 levels of dimensional decomposition; ratings were evaluated in terms of inter rater agreement and rating accuracy. Decomposition led to increased agreement and accuracy, but the use of distributional ratings did not. The practical implications of the results are discussed.This study was conducted in fulfillment of the requirements for Robert A. Jako's master of science degree from Colorado State University.We thank Douglas Reynolds for comments and suggestions on earlier drafts of this article.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.