With the increasing recognition and application of casemix for managing and financing healthcare resources, the evaluation of alternative versions of systems such as diagnosis-related groups (DRGs) has been afforded high priority by governments and researchers in many countries. Outside the United States, an important issue has been the perceived need to produce local versions, and to establish whether or not these perform more effectively than the US-based classifications. A discussion of casemix evaluation criteria highlights the large number of measures that may be used, the rationale and assumptions underlying each measure, and the problems in interpreting the results. A review of recent evaluation studies from a number of countries indicates that considerable emphasis has been placed on the predictive validity criterion, as measured by the R2 statistic. However, the interpretation of the findings has been affected greatly by the methods used, especially the treatment and definition of outlier cases. Furthermore, the extent to which other evaluation criteria have been addressed has varied widely. In the absence of minimum evaluation standards, it is not possible to draw clear-cut conclusions about the superiority of one version of a casemix system over another, the need for a local adaptation, or the further development of an existing version. Without the evidence provided by properly designed studies, policy-makers and managers may place undue reliance on subjective judgments and the views of the most influential, but not necessarily best informed, healthcare interest groups.