When using the Rasch model, equating with a nonequivalent groups anchor test design is commonly achieved by adjustment of new form item difficulty using an additive equating constant. Using simulated 5‐year data, this report compares 4 approaches to calculating the equating constants and the subsequent impact on equating results. The 4 approaches are mean difference, mean difference with outlier removal using the 0.3 logit rule, mean difference with robust z statistic, and the information‐weighted mean difference. Factors studied included sample size, anchor test length, percentage of anchor items displaying outlier behavior, and the distribution of test item difficulty relative to examine ability. The results indicated that the mean difference and information‐weighted mean difference methods performed similarly across all conditions. In addition, with larger sample sizes, the mean difference with 0.3 logit method performed similarly to these 2 methods. The mean difference with robust z method performed most differently from the other three methods of calculating the equating constant. This method removed a large percentage of the anchor items compared to the mean difference with 0.3 logit method but seemed to produce the most stable trend in performance classification across the 5 years, particularly when sample sizes were large.
This study assessed the factor structure of the Test of English for International Communication (TOEIC®) Listening and Reading test, and its invariance across subgroups of test-takers. The subgroups were defined by (a) gender, (b) age, (c) employment status, (d) time spent studying English, and (e) having lived in a country where English is the main language. The study results indicated that a correlated two-factor model corresponding to the two language abilities of listening and reading best accounted for the factor structure of the test. In addition, the underlying construct had the same structure across the test-taker subgroups studied. There were, however, significant differences in the means of the latent construct across the subgroups. This study provides empirical support for the current score reporting practice for the TOEIC test, suggests that the test scores have the same meaning across studied test-taker subgroups, and identifies possible test-taker background characteristics that affect English language abilities as measured by the TOEIC test.
This study examined the heterogeneity in the English‐as‐a‐second‐language (ESL) test population by modeling the relationship between test‐taker background characteristics and test performance as measured by the TOEFL iBT® using a confirmatory factor analysis (CFA) with covariate approach. The background characteristics studied included: (a) main reason for taking the TOEFL iBT test; (b) time spent studying English; (c) time spent attending a school, college, or university in which content classes were taught in English; and (d) lived in a country where English is the main language. The results indicated that at most levels of the background characteristics studied, there were statistically significant differences in the means of the four underlying latent factors (reading, listening, speaking, and writing) representing English‐language proficiency (ELP). Overall, the effect size differences on the reading, listening, speaking, and writing latent factors among the levels for each of the background variables studied ranged from small to medium. The results of this study provide empirical evidence of the association and possible influence of test‐taker background characteristics on the four underlying latent factors representing ELP and, thus, on test performance.
This study used survival analysis to examine the patterns and factors associated with time to achieving designated score criteria on a test of English as a foreign language. This was modeled using an extension of the Cox regression model, with two criterion score levels defined as achieving a TOEFL iBT® total test scale score at or above the Common European Framework of Reference (CEFR) Level B2 and at Level C1, respectively. Factors included in the model were test taker background characteristics including age, gender, native language type, exposure to English, and reason for testing. Additionally, to account for those who tested more than once within the study period, and thus had multiple records, an indicator for order of testing occasion was included in the model. Results indicate that approximately 82% of the test takers in our study sample tested one time in the study period (2014–2016), and the number of repeaters decreased rapidly across occasions. For those who did not achieve the designated criterion scores at first testing, the likelihood of achievement increases with repeated testing, with a somewhat greater effect for the less stringent B2 criterion. Results also indicate that the association of gender with performance differed across levels.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.