The user experience of printing web pages has not been very good. Web pages typically contain contents that are not printworthy or informative such as side bars, footers, headers, advertisements, and auxiliary information for further browsing. Since the inclusion of such contents degrades the web printing experience, we have developed a tool that first selects the main part of the web page automatically and then allows users to make adjustments. In this paper, we describe the algorithm for selecting the main content automatically during the first pass. The web page is first segmented into several coherent areas or blocks using our web page segmentation method that clusters content based on the affinity values between basic elements. The relative importance values for the segmented blocks are computed using various features and the main content is extracted based on the constraint of one DOM (Document Object Model) sub-tree and high important scores. We evaluated our algorithm on 65 web pages and computed the accuracy based on area of overlap between the ground truth and the extracted result of the algorithm.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.