“…We evaluate models using nDCG@10, M AP , Recall at rank k with k in {10, 50, 100} (R@k). Additionally, we compute three measures specifically designed for the task of CS: True Negative Rate at 95% Recall (T N R@95%) [40,41], normalised Precision at 95% Recall (nP @95%) [41], and average position at which the last relevant item is found [30,31,32], calculated as a percentage of the dataset size, where a lower value indicates better performance (Last Rel).…”