For many topics, the World Wide Web contains hundreds or thousands of relevant documents of widely varying quality. Users face a daunting challenge in identifying a small subset of documents worthy of their attention. Link analysis algorithms have received much interest recently, in large part for their potential to identify high quality items.We report here on an experimental evaluation of this potential. We evaluated a number of link and content-based algorithms using a dataset of web documents rated for quality by human topic experts. Link-based metrics did a good job of picking out high-quality items. Precision at 5 is about 0.75, and precision at 10 is about 0.55; this is in a dataset where 0.32 of all documents were of high quality. Surprisingly, a simple content-based metric performed nearly as well; ranking documents by the total number of pages on their containing site.
For many purposes, the Web page is too small a unit of interaction and analysis. Web sites are structured multimedia documents consisting of many pages, and users often are interested in obtaining and evaluating entire collections of topically related sites. Once such a collection is obtained, users face the challenge of exploring, comprehending, and organizing the items. We report four innovations that address these user needs: (1) we replaced the Web page with the Web site as the basic unit of interaction and analysis; (2) we defined a new information structure, the clan graph, that groups together sets of related sites; (3) we augment the representation of a site with a site profile, information about site structure and content that helps inform user evaluation of a site; and (4) we invented a new graph visualization, the auditorium visualization, that reveals important structural and content properties of sites within a clan graph. Detailed analysis and user studies document the utility of this approach. The clan graph construction algorithm tends to filter out irrelevant sites and discover additional relevant items. The auditorium visualization, augmented with drill-down capabilities to explore site profile data, helps users to find high-quality sites as well as sites that serve a particular function.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.