Web content extraction using contextual rules

Pouramini, Ahmad; Nasiri, Shahram

doi:10.1109/kbei.2015.7436183

Cited by 2 publications

(2 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Other important fields where template extraction is particularly useful are boilerplate removal [12,16,46], wrapper generation [32,43,60], wrapper induction [40,59], wrapper maintenance [29,39], and automated data extraction (see, e.g., [16,27,31]).…”

Section: Introductionmentioning

confidence: 99%

What Web Template Extractor Should I Use? A Benchmarking and Comparison for Five Template Extractors

2019

View full text Add to dashboard Cite

A Web template is a resource that implements the structure and format of a website, making it ready for plugging content into already formatted and prepared pages. For this reason, templates are one of the main development resources for website engineers, because they increase productivity. Templates are also useful for the final user, because they provide uniformity and a common look and feel for all webpages. However, from the point of view of crawlers and indexers, templates are an important problem, because templates usually contain irrelevant information, such as advertisements, menus, and banners. Processing and storing this information leads to a waste of resources (storage space, bandwidth, etc.). It has been measured that templates represent between 40% and 50% of data on the Web. Therefore, identifying templates is essential for indexing tasks. There exist many techniques and tools for template extraction, but, unfortunately, it is not clear at all which template extractor should a user/system use, because they have never been compared, and because they present different (complementary) features such as precision, recall, and efficiency. In this work, we compare the most advanced template extractors. We implemented and evaluated five of the most advanced template extractors in the literature. To compare all of them, we implemented a workbench, where they have been integrated and evaluated. Thanks to this workbench, we can provide a fair empirical comparison of all methods using the same benchmarks, technology, implementation language, and evaluation criteria.

show abstract

Section: Introductionmentioning

confidence: 99%

What Web Template Extractor Should I Use? A Benchmarking and Comparison for Five Template Extractors

2019

View full text Add to dashboard Cite

show abstract

“…Methods based on blocks mainly contain these kinds of algorithms: document object model (DOM) based page segmentation [5][6][7][8], vision-based page segmentation [9,10], specific tag based page segmentation [11,12], hybrid methods [13], and semantic based page segmentation. DOM based page segmentation uses hierarchical relations in tags to extract the main content [5,14]. Xpath can be used to locate content nodes in html where DOM is a kind of XML [15].…”

Section: Introductionmentioning

confidence: 99%

Main Content Extraction from Web Pages Based on Node Characteristics

Liu¹,

Shao²,

Wu³

et al. 2017

Journal of Computing Science and Engineering

View full text Add to dashboard Cite

Main content extraction of web pages is widely used in search engines, web content aggregation and mobile Internet browsing. However, a mass of irrelevant information such as advertisement, irrelevant navigation and trash information is included in web pages. Such irrelevant information reduces the efficiency of web content processing in content-based applications. The purpose of this paper is to propose an automatic main content extraction method of web pages. In this method, we use two indicators to describe characteristics of web pages: text density and hyperlink density. According to continuous distribution of similar content on a page, we use an estimation algorithm to judge if a node is a content node or a noisy node based on characteristics of the node and neighboring nodes. This algorithm enables us to filter advertisement nodes and irrelevant navigation. Experimental results on 10 news websites revealed that our algorithm could achieve a 96.34% average acceptable rate.

show abstract

Web content extraction using contextual rules

Cited by 2 publications

References 13 publications

What Web Template Extractor Should I Use? A Benchmarking and Comparison for Five Template Extractors

What Web Template Extractor Should I Use? A Benchmarking and Comparison for Five Template Extractors

Main Content Extraction from Web Pages Based on Node Characteristics

Contact Info

Product

Resources

About