2016
DOI: 10.1145/2897350.2897353
|View full text |Cite
|
Sign up to set email alerts
|

Web Content Extraction

Abstract: In this paper, we present a meta-analysis of several Web content extraction algorithms, and make recommendations for the future of content extraction on the Web. First, we find that nearly all Web content extractors do not consider a very large, and growing, portion of modernWeb pages. Second, it is well understood that wrapper induction extractors tend to break as theWeb changes; ; heuristic/ feature engineering extractors were thought to be immune to a Web site's evolution, but we find that this is not the c… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2017
2017
2019
2019

Publication Types

Select...
3
3

Relationship

0
6

Authors

Journals

citations
Cited by 15 publications
(3 citation statements)
references
References 25 publications
0
3
0
Order By: Relevance
“…Thus, it is necessary to let the Web browser extract the interaction relevant elements of a Web page and observe changes in their properties. However, the introspection of elements in combination with the dynamics of Web pages are not yet reflected in Web page data extraction research, which is mostly focused on pure content extraction [16,83]. Recent work describes methods that allow for observation of preselected, dynamic elements [12,53].…”
Section: Methods For Interface Introspectionmentioning
confidence: 99%
“…Thus, it is necessary to let the Web browser extract the interaction relevant elements of a Web page and observe changes in their properties. However, the introspection of elements in combination with the dynamics of Web pages are not yet reflected in Web page data extraction research, which is mostly focused on pure content extraction [16,83]. Recent work describes methods that allow for observation of preselected, dynamic elements [12,53].…”
Section: Methods For Interface Introspectionmentioning
confidence: 99%
“…Most crawling techniques are based on the analysis of the HTML code of the Web pages, disregarding many aspects that have an impact on the resulting page, such as client side scripts, asynchronous functions, or CSS styles, some of which may even alter the DOM tree and contents of the HTML page [95]. Furthermore, crawlers are usually neglecting elements that may contribute to a more efficient and effective performance, such as HTML5 semantic tags or microformats, amongst others.…”
Section: Crawling Paths Learningmentioning
confidence: 99%
“…Note that this also results in better scalability, which is another main concern [70]. -Automatically update the crawled pages to improve their freshness [19,31,71,73,95,104].…”
Section: Performance Measuresmentioning
confidence: 99%