In this paper, we address automatic identification of common functional structures on web pages, a fundamental problem for web automation applications and graphical user interface testing. In contrast to other approaches, we aim to identify relevant patterns without relying on the source code of a web page or keywords, utilizing mostly geometrical and visually perceptible properties. We achieve this by transforming pages into an independent geometrical representation, on top of which we extract a set of features that allows us to employ traditional machine learning techniques for the identification task. We evaluate this approach by analyzing three typical scenarios, reviewing the obtained information retrieval key metrics and estimating the relevance of the chosen features. Our initial results demonstrate the feasibility of the proposed approach.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.