Learning Web Page Block Functions using Roles of Images

Yang, Xin; Shi, Yuanchun

doi:10.1109/icpca.2008.4783565

Cited by 2 publications

(1 citation statement)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The authors list site navigation elements among their classification labels, but the F 1 value they achieve for this block (82%) is significantly lower than our result for pagination links (99%) and the definition includes navigation menus and similar blocks. The same is true for [17] and [9]. They also consider navigation blocks, but with a wider definition and significantly lower accuracy (87% and 53%).…”

Section: Related Workmentioning

confidence: 91%

Turn the Page: Automated Traversal of Paginated Websites

Furche

Grasso

Kravchenko

et al. 2012

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Content-intensive web sites, such as Google or Amazon, paginate their results to accommodate limited screen sizes. Thus, human users and automatic tools alike have to traverse the pagination links when they crawl the site, extract data, or automate common tasks, where these applications require access to the entire result set. Previous approaches, as well as existing crawlers and automation tools, rely on simple heuristics (e.g., considering only the link text), falling back to an exhaustive exploration of the site where those heuristics fail. In particular, focused crawlers and data extraction systems target only fractions of the individual pages of a given site, rendering a highly accurate identification of pagination links essential to avoid the exhaustive exploration of irrelevant pages. We identify pagination links in a wide range of domains and sites with near perfect accuracy (99%). We obtain these results with a novel framework for web block classification, BER y L, that combines rule-based reasoning for feature extraction and machine learning for feature selection and classification. Through this combination, BER y L is applicable in a wide settings range, adjusted to maximise either precision, recall, or speed. We illustrate how BER y L minimises the effort for feature extraction and evaluate the impact of a broad range of features (content, structural, and visual). The research leading to these results has received funding from the European Research Council under the European Community's Seventh Framework Programme (FP7/2007-2013) / ERC grant agreement DIADEM, no. 246858.

show abstract

Section: Related Workmentioning

confidence: 91%