Existing IR measures for offline evaluation directly bring in the labels into computation, where the labels are on the entire documents. This direct dependency makes the measure highly reliant on the completeness of the labels, consequently the measure values are sensitive towards missing labels, resulting in poor robustness and reusability. To mitigate this, we propose a novel evaluation approach, constructing an intermediate layer between the labels and the measure, improving the robustness and reusability by dampening the direct dependency, as well as considering the content of the document in the measure computation. In particular, we propose to estimate a language model based on a selected relevant document set to construct a ground truth, afterward using the divergence between the search result and this ground truth to compute measures. To further save labeling efforts and to improve efficiency, we select representative documents, query set and topic terms involved in the evaluation separately before computing the measure. Preliminary experiments on the diversity tasks of TREC Web Track 2009-2012, using ClueWeb09-A as a document collection, show that with as little as 30% of judgments our novel approach almost accurately reconstructs the original system rankings determined by α-nDCG, ERR-IA, and NRBP.