Rating scale design and development for testing speaking is generally conducted using one of two approaches: the measurement-driven approach or the performance-data driven approach. The measurement-driven approach prioritizes the ordering of descriptors onto a single scale. Meaning is derived from the scaling methodology and the agreement of trained judges as to the place of any descriptor on the scale. The performance data-driven approach, on the other hand, places primary value upon observations of language performance, and attempts to describe performance in sufficient detail to generate descriptors that bear a direct relationship with the original observations of language use. Meaning is derived from the link between performance and description. We argue that measurement-driven approaches generate impoverished descriptions of communication, while performance data-driven approaches have the potential to provide richer descriptions that offer sounder inferences from score meaning to performance in specified domains. With reference to original data and the literature on travel service encounters, we devise a new scoring instrument, a Performance Decision Tree (PDT). This instrument prioritizes what we term 'performance effect' by explicitly valuing and incorporating performance data from a specific communicative context. We argue that this avoids the reification of ordered scale descriptors which we find in measurement-driven scale construction for speaking tests.
Just like buildings, tests are designed and built for specific purposes, people, and uses. However, both buildings and tests grow and change over time as the needs of their users change. Sometimes, they are also both used for purposes other than those intended in the original designs. This paper explores architecture as a metaphor for language test development. Firstly, it describes test purpose and use, and how this affects test design. Secondly, it describes and illustrates the layers of test architecture and design. Thirdly, it discusses the concept of test retrofit, which is the process of altering the test after it has been put into operational use. We argue that there are two types of test retrofit: an upgrade and a change. Each type of retrofit implies changes to layers of the test architecture which must be articulated for a validity argument to be constructed and evaluated. As is true in architecture, we argue that a failure to be explicit about retrofit seriously limits validity claims and clouds issues surrounding the intended effect of the test upon users.
Content considerations are widely viewed to be essential in the design of language tests, and evidence of content relevance and coverage provides an important component in the validation of score interpretations. Content analysis can be viewed as the application of a model of test design to a particular measurement instrument, using judgements of trained analysts. Following Bachman (1990), a content analysis of test method characteristics and components of communicative language ability was performed by five raters on six forms of an EFL test from the University of Cambridge Local Examinations Syndicate. To investigate rater agreement, generalizability analysis and a new agreement statistic (the rater agree ment proportion or 'RAP') were used. Results indicate that the overall level of rater agreement was very high, and that raters were more consistent in rating method than ability. To examine interform comparability, method/ability content analysis characteristics (called 'facets') which differed by more than one standard deviation of either form were deemed to be salient. Results indicated that not all facets yielded substantive information about interform content comparability, although certain test characteristics could be targeted for further revision and development. The relationships between content analysis ratings and two-para meter IRT item parameter estimates (difficulty and discrimination) were also investigated. Neither test method nor ability ratings by themselves yielded consist ent predictions of either item discrimination or difficulty across the six forms examined. Fairly high predictions were consistently obtained, however, when method and ability ratings were combined. The implications of these findings, as well as the utility of content analysis in operational test development, are dis cussed.
This study has sought to demonstrate the utility of Rasch Model scalar analysis when applied to self-ratings of ability/difficulty associated with component skills of English as a second language. Eleven skill areas were rated for difficulty on a seven-point Likert-type scale by 228 ESL students at the University of California. Following appropriate tests of unidimensionality, both skill area items and rating categories were calibrated for difficulty, examined for fit to the Rasch Model, and plotted to provide visual representation of the nature of the item characteristic curves. Specific suggestions were made for the improvement of the rating categories of the self-rating scale, and skill areas most susceptible to selfrating error were identified. It was concluded that scalar analysis of the kind considered here is feasible with self-rating data, and that other rating scale procedures such as those employed. to rate proficiency in foreign language speaking or writing would probably benefit from similar scalar analyses.In the measurement of ability in the use of English as a second or foreign language, historically much use has been made of performance ratings, whether ratings consisted of self-estimates of ability or judgements by qualified others. In the measurement of ability in productive skills of language use it is likely that such rating scales and procedures are here to stay. This being the case, it is important that every precaution be taken and every tool be employed to ensure that rating scales be applied in the most accurate, meaningful and readily interpretable manner possible. Wright and Masters (1982) have noted that for any meaningful measurement to take place, four basic requirements must be met:
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.