Recent developments in the field of artificial intelligence allow for improved performance in the automated assessment of extended response items in mathematics, potentially allowing for the scoring of these items cheaply and at scale. This study details the grand prize-winning approach to developing large language models (LLMs) to automatically score the ten items in the National Assessment of Educational Progress (NAEP) Math Scoring Challenge. The approach uses extensive preprocessing that balanced the class labels for each item. This was done by identifying and filtering over-represented classes using a classifier trained on document-term matrices and data augmentation of under-represented classes using a generative pre-trained large language model (Grammarly’s Coedit-XL; Raheja et al., 2023). We also use input modification schemes that were hand-crafted to each item type and included information from parts of the multi-step math problem students had to solve. Finally, we finetune several pre-trained large language models on the modified input for each individual item in the NAEP automated math scoring challenge, with DeBERTa (He et al., 2021a) showing the best performance. This approach achieved human-like agreement (less than QWK 0.05 difference from human–human agreement) on nine out of the ten items in a held-out test set.