Background: Comparing outcomes at different neonatal intensive care units (NICUs) requires adjustment for intrinsic risk. The Clinical Risk Index for Babies (CRIB) is a widely used risk model, but it has been criticized for being affected by therapeutic decisions. The Prematurity Risk Evaluation Measure (PREM) is not supposed to be prone to treatment bias, but has not yet been validated. Objectives: We aimed to validate the PREM, compare its accuracy to that of the original and modified versions of the CRIB and CRIB-II, and examine the congruence of risk categorization. Methods: Very-low-birth-weight (VLBW) infants with a gestational age (GA) <33 weeks, who were admitted to NICUs in Baden-Württemberg from 2003 to 2008, were identified from the German neonatal quality assurance program. CRIB, CRIB-II and PREM scores were calculated and modified. Omitting variables that directly reflected therapeutic decisions [the applied fraction of inspired oxygen (FiO2)] or that may have been prone to early-treatment bias (base excess and temperature), non-NICU-therapy-influenced scores were obtained. Score performance was assessed by the area under their ROC curve (AUC). Results: The CRIB showed the largest AUC (0.89), which dropped significantly (to 0.85) after omitting the FiO2. The PREM birth condition model, PREM(bcm) (AUC 0.86), and the PREM birth model, PREM(bm) (AUC 0.82), also demonstrated good discrimination. PREM(bm) was superior to other non-therapy-affected scores and to GA, particularly in infants with <750 g birth weight. Congruence of risk categorization was low, especially among higher-risk cases. Conclusions: The CRIB score had the largest AUC, resulting from its inclusion of FiO2. PREM(bm), as the most accurate score among those unaffected by early treatment, seems to be a good alternative for strict risk adjustment in NICU auditing. It could be useful to combine scores.