A vast body of methodological research on criterion-referenced testing has been amassed over the past decade. Much of that research is synthesized in the articles contained in this issue. The fact that this issue is devoted exclusively to criterion-referenced testing sets it apart as a quintessential journal publication on the topic. This paper is intended to provide a broad framework for understanding and evaluating the individual contributions in the context of the literature. The six articles appear to fall into four major categories: (1) test length; (2) validity; (3) standard setting; and (4) reliability. These categories correspond to most of the technical topics in the test development process (see, e.g., Berk, 1980b;Hambleton, 1980).
Test LengthIf a teacher or curriculum specialist asked a psychometrician, 'How many items should be written for each objective?' or 'How many items should be sampled from the domain?' what answer could be given? Whether the test is norm referenced or criterion referenced, there is no available source that recommends a magical number of items. The guidelines offered in most measurement texts and technical papers on the topic are rather nonspecific; however, this perplexing issue cannot be dismissed or ignored, as are the properties of validity and reliability analyzed in subsequent sections of this paper. Every test maker, teacher through test publisher, must answer the test length question.The problem of determining test length can be approached from a practical perspective based on research evidence and/or from a purely technical perspective. The former has been explicated previously in terms of a multiplicity of factors, including importance and type of decision making, importance and emphases assigned to objectives, number of objectives, and practical constraints (Berk, 1980c); the latter constitutes the orientation of Wilcox's article.Wilcox has organized his review according to three achievement test conceptualizations based on three different types of true score: (1) the number of items a student would answer correctly if every item in the item domain was answered; (2) the proportion of skills among a domain of skills that a APPLIED