For the welfare assessment on commercial broiler and turkey farms, not only the recording of animal-based indicators but also the evaluation of the resulting prevalence or rate is essential. Two evaluation methods were compared using data on welfare indicators collected over 1 year from 11 broiler and 11 turkey farms in Germany: the application of normative values from an evaluation framework and the calculation of a benchmark. The evaluation framework had recently been developed in a participatory process that provided an evaluation with target and alarm values. The target range was predominantly based on ethical considerations, while the alarm range was aligned with the current status quo from farm investigations. The 25th percentile and the 75th percentile of the benchmarking were similarly classified as target and alarm. When applying the evaluation framework across all indicators and flocks, 30.6% of broiler flocks were in the target range, while 41.4% were in the alarm range, mostly for indicators such as footpad dermatitis, weight uniformity, and mortality. For turkeys at week 5 or at the end of the fattening period, 51.6% and 32.9%, respectively, were in the target range and 12.3% and 14.4% were in the alarm range. Most alarm classifications were related to footpad dermatitis, low-weight uniformity, plumage damage, and skin injuries. The application of normative values led to a significantly worse average welfare rank over all indicators and flocks for broilers compared to the benchmark, while no difference was observed for turkeys. The farm selection process may have favored turkey farms with better management practices, resulting in a more rigorous benchmark than in broilers. In addition, the farm data used to set the normative values had indicated a poorer status quo in turkeys for certain indicators, resulting in less stringent limits for the alarm range. This highlights the challenges associated with both evaluation methods: normative values are affected by the process and criteria used to set them, while benchmarks are affected by the reference population, which calls for large databases with regular updates. Also, for normative values, developments in the sector and the latest scientific evidence should be used for recurrent validation.