Test theory addresses statistical methods and criteria for the design, evaluation, and comparison of “tests,” that is, standardized procedures for measuring constructs of interest, typically encompassing several test items. These methods and criteria are based on assumptions about the composition of the measurements, the properties of these components, and the relationship between them. The most relevant test theoretical frameworks are classical test theory (CTT) and, as a representative of “modern” test theories, item response theory (IRT). The basic assumptions and procedures of the CTT were developed predominantly during the first half of the twentieth century. Within this framework, influential methods and criteria were developed for the evaluation of tests as a whole, for example, reliability, and for individual test items, for example, item difficulty and item discrimination. Since then, several serious limitations of CTT have been identified, such as the test and sample dependency of CTT's person and item parameters, incomparable scaling of person and item parameters, empirical interdependencies between parameters that should be independent theoretically, and the assumption of reliability as a uniform property of a test. Modern measurement models, like IRT, avoid or overcome these limitations of CTT. However, they do so only at the expense of more demanding assumptions, greater mathematical complexity, and more time and effort (including bigger sample sizes) needed for test construction and evaluation. Moreover, CTT‐ and IRT‐based evaluations often come to comparable conclusions regarding the quality and exclusion of items.