A better understanding of statistics in test revisions is of paramount importance for language testers, who can use these measures to further improve the validity of interpretation and use of test scores. To achieve this goal, this chapter examines three types of statistics. First, it describes statistics for understanding the distribution of examinee scores, including means, medians, modes, standard deviations, skewness, and kurtosis. Second, it describes statistics for evaluating the characteristics of items, tasks, and tests, with particular emphasis on item difficulty, discrimination (biserial and point biserial correlations, as well as phi statistics), and distracter analysis. Third, the chapter focuses on statistics for analyzing rating scales and raters, including inter‐rater and intra‐rater reliability, and kappa statistics. More emphasis is placed on generalizability theory (G‐theory) and many‐facet Rasch measurement—methods that have been widely used among language testers. G‐theory is helpful in systematically investigating the reliability of instruments under specific conditions by considering multiple sources of error. Many‐facet Rasch measurement is well suited to examining rater severity or leniency, rater consistency, interaction between rater and item (called rater‐by‐item bias), and the difficulty level of each task. An overview of these three types of statistics is followed by a comparative discussion on software useful for test revision. This includes general‐purpose software (Excel, R, SAS, and SPSS) and specific‐purpose software (ITEMAN, GENOVA, and FACETS). Reporting practice and examples using real data are described for the three types of statistics. We also discuss the benefit of reporting commands, scripts, or syntax (with annotated comments) whenever possible and appropriate. The chapter concludes by stressing the need for a sound statistical reporting practice in test revision: Readers can thereby understand how items, tests, and tasks have been revised based on the analyses, and how this has improved the validity argument for a particular instrument.