In this research, we investigated the suitability of implementing e‐rater® automated essay scoring in a high‐stakes large‐scale English language testing program. We examined the effectiveness of generic scoring and 2 variants of prompt‐based scoring approaches. Effectiveness was evaluated on a number of dimensions, including agreement between the automated and the human score and relations with criterion variables. Results showed that the sample size was generally not sufficient for prompt‐specific scoring. For the generic scoring model, automated scores agreed with human raters as strongly as, or more strongly than, human raters agreed with one another for more than 97% of the prompts. The impact of substituting e‐rater for the second human rater made no practically important impact on test takers' scores at both the item and total test score levels. However, neither automated scoring models nor human raters performed invariantly across all prompts or across different test countries/territories. Further investigation indicated homogeneity in the examinee population, possibly nested within test countries/territories as one potential cause of this lack of invariance. Among other limitations, findings may not be generalizable beyond the examinee population investigated in this study.