The growing availability of online K-12 curriculum is increasing the need for meaningful alignment of this curriculum with state-specific standards. Promising automated and semi-automated alignment tools have recently become available. Unfortunately, recent alignment evaluation studies report low inter-rater reliability, e.g., 32% with two raters and 35 documents. While these results are in line with studies in other domains, low reliability makes it difficult to accurately train automatic systems and complicates comparison of different services. We propose that inter-rater reliability of broadly defined, abstract concepts such as 'alignment' or 'relevance' must be expected to be low due to the real-world complexity of teaching and the multidimensional nature of the curricular documents. Hence, we suggest decomposing these concepts into less abstract, more precise measures anchored in the daily practice of teaching.This article reports on the integration of automatic alignment results into the interface of the TeachEngineering collection and on an evaluation methodology intended to produce more consistent document relevance ratings. Our results (based on 14 raters x 6 documents) show high inter-rater reliability (61 -95%) on less abstract relevance dimensions while scores on the overall 'relevance' concept are (as expected) lower (64%). Despite a relatively small sample size, regression analysis of our data resulted in an explanatory (R 2 = .75) and statistically stable (p-values < .05) model for overall relevance as indicated by matching concepts, related background material, adaptability to grade level, and anticipated usefulness of exercises. Our results suggest that more detailed relevance evaluation which includes several dimensions of relevance would produce better data for comparing and training alignment tools.