An Examination of Rater Orientations and Test-Taker Performance on English-for-Academic-Purposes Speaking Tasks

Brown, Annie; Iwashita, Noriko; McNamara, Tim

doi:10.1002/j.2333-8504.2005.tb01982.x

Cited by 122 publications

(199 citation statements)

References 43 publications

Supporting

Mentioning

178

Contrasting

Unclassified

Order By: Relevance

“…Based on Brown, Iwashita, & McNamara (2005), the rubrics for the TOEFL iBT Speaking test were reflective of what teachers of English as a second language and applied linguists thought were important in evaluating candidates' speaking performance in an academic environment. However, the features used in the automated scoring model were only a subset of the criteria used by the human raters, reducing the model's power in explaining candidates' performance on realworld speaking tasks.…”

Section: Rebuttals Counterevidencementioning

confidence: 99%

AUTOMATED SCORING OF SPONTANEOUS SPEECH USING SPEECHRATER^SM V1.0

Xi¹,

Higgins²,

Zechner³

et al. 2008

ETS Research Report Series

View full text Add to dashboard Cite

This report presents the results of a research and development effort for SpeechRater SM Version 1.0 (v1.0), an automated scoring system for the spontaneous speech of English language learners used operationally in the Test of English as a Foreign Language™ (TOEFL ® ) Practice Online assessment (TPO). The report includes a summary of the validity considerations and analyses that drive both the development and the evaluation of the quality of automated scoring. These considerations include perspectives on the construct of interest, the context of use, and the empirical performance of the SpeechRater in relation to both the human scores and the intended use of the scores. The outcomes of this work have implications for short-and long-term goals for iterative improvements to SpeechRater scoring. which is used by prospective test takers to prepare for the official TOEFL iBT test. This study reports the development and validation of the system for low-stakes practice purposes. The process we followed to build this system represented a principled approach to maximizing 2 essential qualities: substantively meaningful and technically sound. In developing and evaluating the features and the scoring models to predict human assigned scores, we engaged both content and technical experts actively to ensure the construct representation and technical soundness of the system. We compared primarily two alternative methodologies of building scoring modelsmultiple regression and classification trees-in terms of their construct representation and empirical performance in predicting human scores. Based on the evaluation results, we concluded that a multiple regression model with feature weights determined by content experts was superior to the other competing models evaluated.We then used an argument-based approach to integrate and evaluate the existing evidence to support the use of SpeechRater SM v1.0 in a low-stakes practice environment. The argumentbased approach provided a mechanism for us to articulate the strengths and weaknesses in the validity argument for using SpeechRater v1.0 and put forward a transparent argument for using it for a low-stakes practice environment. In particular, the construct representation of the multiple regression model with expert weights was sufficiently broad to justify its use in a low-stakes application. While some higher-order aspects of the speaking construct (such as content and organization) are missing, more basic aspects of the construct (such as pronunciation and fluency) are richly represented. In addition, these different parts of the speaking construct tend to be highly correlated, so that the absence of higher order factors is not as detrimental to the model's agreement with human raters as it otherwise might be. The model's agreement with human raters was not sufficiently high to support high-stakes decisions but was still suitable for use in low-stakes applications. The correlation of the 6-item aggregate score with human raters was .57 and was deemed acceptable given the lo...

show abstract

Section: Rebuttals Counterevidencementioning

confidence: 99%

AUTOMATED SCORING OF SPONTANEOUS SPEECH USING SPEECHRATER^SM V1.0

Xi¹,

Higgins²,

Zechner³

et al. 2008

ETS Research Report Series

View full text Add to dashboard Cite

show abstract

“…In particular, descrepancy between multiple raters' judgement has always been an area of intest since 1920s. Some interesting findings relevant to vocabulary assessment in oral examinations have been identified in the past studies (Brown, Iwashita, & McNamara, 2005;Lorenzo-Dus & Meara, 2005;Malvern & Richards, 2002;Read, 2000Read, , 2005. To date, however, few studies have focused solely on rater performance in assessing vocabulary.…”

Section: Introductionmentioning

confidence: 86%

“…Some findings include: (1) Raters' judgements on vocabulary do not correlate with lexical diversity (D) in oral proficiency interviews (Lorenzo-Dus & Meara, 2005;Malvern & Richards, 2002); (2) Raters exhibit idiosyncratic approaches regarding saliency of lexical features in assessing vocabulary in oral interviews. They typically make more negative than positive comments on vocabulary (Brown, 2006); (3) Raters' judgements are sensitive to word types, tokens, and difficult words in OPIs (Brown et al, 2005;Lorenzo-Dus & Meara, 2005); (4) Raters have conflicting views on assessing linguistic aspects vis-à-vis pragmatic aspects of vocabulary in oral examinations (Brown et al, 2005); (5) It is difficult for raters to assess vocabulary at adjacent IELTS band levels (Read, 2005); (6) High correlations have been found between subcategories in oral examinations, such as vocabulary, grammar, fluency, etc. (Brown and Taylor, 2006;Malvern and Richards, 2002;Taylor and Jones, 2001); and (7) Vocabulary and grammar were prone to be rated more harshly than other constructs in oral examinations (Galaczi, 2005).…”

Section: Literature Reviewmentioning

confidence: 99%

How Do Raters Judge Spoken Vocabulary?

Li¹

2016

ELT

View full text Add to dashboard Cite

The aim of the study was to investigate how raters come to their decisions when judging spoken vocabulary. Segmental rating was introduced to quantify raters' decision-making process. It is hoped that this simulated study brings fresh insight to future methodological considerations with spoken data. Twenty trainee raters assessed five Chinese students' monologic texts on vocabulary in this study. Both segmental rating and overall rating were retrieved from the raters. Rasch analysis suggested variation between raters in their judgment of vocabulary, although consistency was found in general. Besides, there was a mismatch between candidates' vocabulary scores and their lexical statistics. The raters' decision-making process was generally cumulative.

show abstract

“…has assumed that different ratings can be controlled by rater training with explicit assessment criteria and samples of performance at different levels [8].…”

Section: Brownmentioning

confidence: 99%

“…English language oral proficiency is usually evaluated by human raters, mostly native speakers [8]. Raters play a major role in the assessment process and influence the quality and meaning of scores obtained.…”

mentioning

confidence: 99%

The Diversity in an English Oral Proficiency Test

Park

2011

Journal of the Korea Academia-Industrial cooperation Society

View full text Add to dashboard Cite

There are many causes for the variation of the result in oral proficiency test such as the examiner, the task, the theme of the interview, and the gender of the participants. Previous literature documents that the rater is an important variable influencing test scores of second language oral proficiency. Although much research in language testing has been conducted concerning rater effect on test scores, there has been little attention paid to the effect of potential rater variables in language testing on their rating process. There are noticeably different contents of the rating scales across different speaking tests developed in different context. Therefore, it would not be appropriate to apply the same rating criteria for various tasks. In conclusion, we need more subject protocol analyses and more thoughtful studies on rating processes. In other words, the oral proficiency test needs a more realistic and valid tool for the assessment of second language proficiency. 요 약 영어 능력평가를 위한 테스트는 평가자, 과제, 인터뷰의 주제, 그리고 평가 받는 사람의 성별 등 여러 가지

show abstract

An Examination of Rater Orientations and Test-Taker Performance on English-for-Academic-Purposes Speaking Tasks

Cited by 122 publications

References 43 publications

AUTOMATED SCORING OF SPONTANEOUS SPEECH USING SPEECHRATER^SM V1.0

AUTOMATED SCORING OF SPONTANEOUS SPEECH USING SPEECHRATER^SM V1.0

How Do Raters Judge Spoken Vocabulary?

The Diversity in an English Oral Proficiency Test

Contact Info

Product

Resources

About

An Examination of Rater Orientations and Test-Taker Performance on English-for-Academic-Purposes Speaking Tasks

Cited by 122 publications

References 43 publications

AUTOMATED SCORING OF SPONTANEOUS SPEECH USING SPEECHRATERSM V1.0

AUTOMATED SCORING OF SPONTANEOUS SPEECH USING SPEECHRATERSM V1.0

How Do Raters Judge Spoken Vocabulary?

The Diversity in an English Oral Proficiency Test

Contact Info

Product

Resources

About

AUTOMATED SCORING OF SPONTANEOUS SPEECH USING SPEECHRATER^SM V1.0

AUTOMATED SCORING OF SPONTANEOUS SPEECH USING SPEECHRATER^SM V1.0