The ClueWeb09 dataset and its successor, the ClueWeb12 dataset, are two of the largest collections of Web pages released by Text REtrieval Conference (TREC). The ClueWeb datasets were used in various tracks of TREC ran through 2009 to 2017. For every year, approximately 50 new queries are released and a pool of Web pages are judged against these queries by human assessors as relevant, non-relevant or spam. In this article, a ground truth for binary classification (spam vs non-spam) is constructed from Web pages that are judged as spam or relevant under the assumption that a Web page judged as relevant for any query cannot be spam. Based on this ground truth, we evaluate classification performances of the Waterloo spam rankings (Fusion, Britney, GroupX and UK2006), which have been traditionally used to identify and filter spam pages in retrieval systems. The experimental results in terms of the universal binary classification evaluation measures suggest that the Fusion (with threshold = 11%) is the best for the ClueWeb09 dataset. Analysis of the frequency distributions of relevant/spam documents over spam scores reveals that the GroupX is the most powerful at identifying relevant documents, whereas the Fusion is the most powerful at identifying spam documents. It is also confirmed that the effectiveness of the Fusion spam ranking of the ClueWeb12 dataset is not as good as that of the ClueWeb09.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.