Monitoring Sources of Variability Within the Test of Spoken English Assessment System

Myford, Carol M.; Wolfe, Edward W.

doi:10.1002/j.2333-8504.2000.tb01829.x

Cited by 35 publications

(38 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In conclusion, the findings of the present study concur with the previous studies by confirming that raters may be affected by factors other than the actual performance of the test-takers (e.g., Chalhoub-Deville, 1995;Chalhoub-Deville & Wigglesworth, 2005;Lumley & McNamara, 1995;Myford & Wolfe, 2000;Winke & Gass, 2012). Whether random or systematic, similar to the other studies, measurement error was observed in this study underlining the influential factors that may cause disagreement within and/or among the raters' judgments in oral performance assessments.…”

Section: Discussionsupporting

confidence: 91%

“…In other words, 75 % of the Total Scores assigned by these 15 raters ranked lower or higher in the post-test, ranging from one point difference to more than 10 points. As discussed by Myford and Wolfe (2000), one point may not seem like or be considered as a large difference, but it can have an important effect for the test takers whose scores are around borderline/pass score. Figure 2 below presents the results about the raters' behavior in terms of (a) whether there was a statistically significant difference between their pre-and post-test scores, and (b) whether they referred to the proficiency levels of the students in their think aloud protocols.…”

Section: Resultsmentioning

confidence: 99%

“…While four main types of rater effects (i.e., halo effects, central tendency, restriction of range, and leniency/severity) are discussed in much detail in the literature (e.g., Lumley & McNamara, 1995;Myford & Wolfe, 2000), the construct-irrelevant factors, the factors other than the actual performances of test-takers that affect raters' behaviors, scoring process, and final scorings, have not been completely explored (Kang, 2012;Stoynoff, 2012). The differences in rater behaviors in terms of leniency/severity toward a particular performance have led researchers to look at another aspect of fair scoring: bias which is an important concept in language testing since test results should be "free from bias" (Weir, 2005, p. 23).…”

Section: Introductionmentioning

confidence: 99%

“…Previous studies have investigated rater effects on oral test scores from different perspectives such as the raters' educational and professional experience (e.g., Chalhoub-Deville, 1995), raters' nationality and native language (e.g., Chalhoub-Deville & Wigglesworth, 2005;Winke & Gass, 2012;Winke et al, 2011), rater training (e.g., Lumley & McNamara, 1995;Myford & Wolfe, 2000), and the gender of candidates and/or interviewers (e.g., O'Loughlin, 2002;O'Sullivan, 2000). For instance, Lumley and McNamara (1995) examined the effect of rater training on the stability of rater characteristics and rater bias whereas MacIntyre, Noels, and Clément (1997) examined bias in self-ratings in terms of participants' perceived competence in an L2 in relation to their actual competence and language anxiety.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Raters Knowledge of Students Proficiency Levels as a Source of Measurement Error in Oral Assessments

Tanriverdi-Köksal¹

2017

HUJE

View full text Add to dashboard Cite

There has been an ongoing debate on the reliability of oral exam scores due to the existence of human raters and the factors that might account for differences in their scorings. This quasi-experimental study investigated the possible effect(s) of the raters' prior knowledge of students' proficiency levels on rater scorings in oral interview assessments. The study was carried out in a pre-and post-test design with 15 EFL instructors who performed as raters in oral assessments at a Turkish state university. In both pre-and post-tests, the raters assigned scores to the same video-recorded oral interview performances of 12 students from three different proficiency levels. While rating the performances, the raters also provided verbal reports about their thought processes. The raters were not informed about the students' proficiency levels in the pre-test, while this information was provided in the posttest. According to the findings, majority of the Total Scores ranked lower or higher in the post-test. The thematic analysis of the raters' video recorded verbal reports revealed that most of the raters referred to the proficiency levels of the students while assigning scores in the post-test. The findings of the study suggest that besides factors such as accent, nationality, and gender of the test-takers and the assessors, raters' prior knowledge of students' proficiency levels could be a variable that needs to be controlled for more reliable test results.Keywords: rater effects, intra-rater reliability, paired oral exams, think-aloud protocols ÖZ: Yaygın olarak kullanılmakta olsa da notlandıran olarak insan faktörünün varlığı ve notlardaki farklılığa neden olan etmenler sebebiyle konuşma sınav notlarının güvenirliği konusunda süregelen bir tartışma vardır. Bu yarı deneysel çalışma, konuşma sınavlarının değerlendirilmesinde, not verenlerin öğrencilerin dil yeterlilik seviyelerini önceden biliyor olmasının verdikleri notları üzerindeki etkilerini araştırmayı amaçlamaktadır. Bu çalışma, Türkiye'deki bir devlet üniversitesinde yabancı dil olarak İngilizce öğreten ve aynı üniversitede konuşma sınavlarında notlandıran olarak görev alan 15 okutman ile ön ve son test olarak iki oturumda yürütülmüştür. Hem ön hem de son testte, notlandıranlar üç farklı seviyeden 12 öğrencinin video kaydına alınmış aynı konuşma sınavı performansları için not vermiştir. Aynı zamanda, performanslar için not verirken, notlandıranlar eş zamanlı olarak ne düşündükleri ile ilgili sözlü bildirimde bulunmuştur. Öğrencilerin dil yeterlilik seviyeleri ile ilgili ön testte herhangi bir bilgi verilmezken, notlandıranlar öğrencilerin seviyeleri konusunda son testte sözlü ve yazılı olarak bilgilendirilmiştir. Sonuçlara göre, önteste kıyasla son testte Toplam Notların büyük çoğunluğunun son testte düştüğü veya yükseldiği saptanmıştır. Tüm notlandıranların video kayıtlı sözlü bildirimleri tematik olarak incelendiğinde, notlandıranların çoğunun son testte not verirken öğrencilerin dil yeterlilik seviyelerine değindikleri gözlemlenmiştir. Çalışmanın ...

show abstract

Section: Discussionsupporting

confidence: 91%

Section: Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Raters Knowledge of Students Proficiency Levels as a Source of Measurement Error in Oral Assessments

Tanriverdi-Köksal¹

2017

HUJE

View full text Add to dashboard Cite

show abstract

“…The extension of the Rasch model for analyses of assessor-mediated ratings is called the many-faceted Rasch model (FACETS model, Linacre, 1989). The FACETS model has been used to examine the psychometric quality of a variety of performance assessments based on assessor-mediated ratings (e.g., Engelhard, 1992Engelhard, , 1994Engelhard, , 1996Heller, Sheingold, & Myford, 1998;Linacre, Engelhard, Tatum, & Myford, 1994;Lunz & Stahl, 1990;Lunz, Wright, & Linacre, 1990;Myford, Marr, & Linacre, 1996;Myford & Mislevy, 1995;Myford & Wolfe, 2000a;Paulukonis, Myford, & Heller, 2000;Wolfe, Chiu, & Myford, 1999). It should be stressed that the FACETS model provides additional information that supplements, rather than supplants, the inferences provided by more traditional methods that have been used previously to analyze the NBPTS assessments.…”

mentioning

confidence: 99%

Investigating Assessor Effects in National Board for Professional Teaching Standards Assessments for Early Childhood/Generalist and Middle Childhood/Generalist Certification

Engelhard¹,

Myford²,

Cline³

2000

ETS Research Report Series

Self Cite

View full text Add to dashboard Cite

The purposes of this study were to (1) examine, describe and evaluate the rating behavior of assessors scoring National Board for Professional Teaching Standards Early Childhood/Generalist or Middle Childhood/Generalist candidates, and (2) explore the effects of employing scoring designs using one assessor per exercise rather than two. Data from the 1997-98 Early Childhood/Generalist and Middle Childhood/Generalist assessments were analyzed using FACETS (Linacre, 1989) and SAS. While assessors differed somewhat in severity, most used the 12-point rating scale consistently. Residual severity effects that persisted after rater training tended to cancel out when ratings were averaged across the ten exercises to produce a scaled score for a candidate. The results of the decision consistency analyses suggest that having one assessor score each exercise would not result in a large increase in candidates incorrectly denied or awarded certification. However, the decision consistency analyses by exercise reveal the low reliability of the exercise banking decisions. Variation in assessor severity can affect the fairness of these decisions.

show abstract

Elicited Speech From Graph Items on the Test of Spoken English™

Katz

Kim

et al. 2004

ETS Research Report Series

View full text Add to dashboard Cite

This research applied a cognitive model to identify item features that lead to irrelevant variance on the Test of Spoken English™ (TSE ® ). The TSE is an assessment of English oral proficiency and includes an item that elicits a description of a statistical graph. This item type sometimes appears to tap graph-reading skills-an irrelevant construct; TSE raters report that many examinees perform worse on this item type than they do on the other 11 items in the test. We adapted a cognitive theory of graph comprehension to predict the degree to which TSE graph items tap irrelevant skills such as graph reading. Through analyses of existing TSE data as well as an experiment, we show how the theory provides specific, empirically justified recommendations on the construction of graph items that minimize the influence of extraneous skills. ETS administers the TOEFL program under the general direction of a policy board that was established by, and is affiliated with, the sponsoring organizations. Members of the TOEFL Board (previously the Policy Council) represent the College Board, the GRE Board, and such institutions and agencies as graduate schools of business, junior and community colleges, nonprofit educational exchange agencies, and agencies of the United States government.A continuing program of research related to the TOEFL test is carried out under the direction of the TOEFL Committee of Examiners. Its 12 members include representatives of the TOEFL Board and distinguished English as a second language specialists from the academic community. The Committee meets twice yearly to review and approve proposals for test-related research and to set guidelines for the entire scope of the TOEFL research program. Members of the Committee of Examiners serve four-year terms at the invitation of the Board; the chair of the committee serves on the Board.Because the studies are specific to the TOEFL test and the testing program, most of the actual research is conducted by ETS staff rather than by outside researchers. Many projects require the cooperation of other institutions, however, particularly those with programs in the teaching of English as a foreign or second language and applied linguistics. Representatives of such programs who are interested in participating in or conducting TOEFL-related research are invited to contact the TOEFL program office. All TOEFL research projects must undergo appropriate ETS review to ascertain that data confidentiality will be protected. Current (2003Current ( -2004 members of the TOEFL Committee of Examiners are:

show abstract

Monitoring Sources of Variability Within the Test of Spoken English Assessment System

Cited by 35 publications

References 16 publications

Raters Knowledge of Students Proficiency Levels as a Source of Measurement Error in Oral Assessments

Raters Knowledge of Students Proficiency Levels as a Source of Measurement Error in Oral Assessments

Investigating Assessor Effects in National Board for Professional Teaching Standards Assessments for Early Childhood/Generalist and Middle Childhood/Generalist Certification

Elicited Speech From Graph Items on the Test of Spoken English™

Contact Info

Product

Resources

About