This paper explores the effectiveness of different semantic web page segmentation algorithms on modern websites. We compare three known algorithms each serving as an example of a particular approach to the problem, and one self-developed algorithm, WebTerrain, that combines two of the approaches. With our testing framework we have compared the performance of four algorithms for a large benchmark we have constructed. We have examined each algorithm for a total of eight different configurations (varying datasets, evaluation metric and the type of the input HTML documents). We found that all algorithms performed better on random pages on average than on popular pages, and results are better when running the algorithms on the HTML obtained from the DOM rather than on the plain HTML. Overall there is much room for improvement as we find the best average F-score to be 0.49, indicating that for modern websites currently available algorithms are not yet of practical use.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.