DOM-based content extraction of HTML documents

Gupta, Suhit; Kaiser, Gail E.; Neistadt, David; Grimm, Peter

doi:10.1145/775152.775182

Cited by 234 publications

(31 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The user query form in the webpage provides a relational database of the website. Gupta et al (2003) discussed about the document object model tree, to achieve identification, content extraction and maintenance of the original data instead of summarizing it. Following, the paper Hammouda and Kamel (2004) presented two main parts of favorably document clustering process.…”

Section: Related Workmentioning

confidence: 99%

A Layout Based Detachment Approach for Extracting Content from Webpages

Chandran¹,

Vijendran²

2015

American Journal of Applied Sciences

View full text Add to dashboard Cite

Enormous amount of useful information presented in Internet is usually formatted for the web users. But it is a really complex task to extract the relevant data from various web sources. Recently, various approaches for the extraction of data from the webpages were proposed. This study provides a simple but effective approach, named Layout Based Detachment Approach (LBDA). The proposed approach extracts the main content from the webpage by removing the irrelevant information like header-footer contents, navigation bars, advertisements and other noisy images. The proposed methodology uses the following techniques: Tag tree parsing to get the analysis structure, block acquiring page segmentation method to remove unwanted tags and data extraction to retrieve the necessary contents. The proposed approach eliminates noise and perform effective extraction of the main content blocks from the webpage and display of the essential content to the users. The performance of the proposed approach is evaluated using the performance metrics such as accuracy, precision, recall, execution time and memory usage. The implementation results obviously show that our proposed LBDA approach exhibits better performance than the existing heuristic approach.

show abstract

Section: Related Workmentioning

confidence: 99%

A Layout Based Detachment Approach for Extracting Content from Webpages

Chandran¹,

Vijendran²

2015

American Journal of Applied Sciences

View full text Add to dashboard Cite

show abstract

“…Important content can be extracted by analyzing the organization structure of the DOM tree and the properties of the DOM nodes. Gupta et al (2003) proposed a DOM-based model to extract main texts from HTML web pages. Their approach, working with the Document Object Model tree as opposed to raw HTML markup, can be used to perform main text extraction, identifying and preserving the original data instead of summarizing it.…”

Section: Related Workmentioning

confidence: 99%

An FAR-SW based approach for webpage information extraction

Zhang

Xia

et al. 2013

Inf Syst Front

View full text Add to dashboard Cite

Automatically identifying and extracting the target information of a webpage, especially main text, is a critical task in many web content analysis applications, such as information retrieval and automated screen reading. However, compared with typical plain texts, the structures of information on the web are extremely complex and have no single fixed template or layout. On the other hand, the amount of presentation elements on web pages, such as dynamic navigational menus, flashing logos, and a multitude of ad blocks, has increased rapidly in the past decade. In this paper, we have proposed a statistics-based approach that integrates the concept of fuzzy association rules (FAR) with that of sliding window (SW) to efficiently extract the main text content from web pages. Our approach involves two separate stages. In Stage 1, the original HTML source is pre-processed and features are extracted for every line of text; then, a supervised learning is performed to detect fuzzy association rules in training web pages. In Stage 2, necessary HTML source preprocessing and text line feature extraction are conducted the same way as that of Stage 1, after which each text line is tested whether it belongs to the main text by extracted fuzzy association rules. Next, a sliding window is applied to segment the web page into several potential topical blocks. Finally, a simple selection algorithm is utilized to select those important blocks that are then united as the detected topical region (main texts). Experimental results on real world data show that the efficiency and accuracy of our approach are better than existing Document Object Model (DOM)-based and Vision-based approaches.

show abstract

“…In [2], Gupta et al propose a method to do extraction based on DOM (Document Object Model) tree. Lan Yi et al propose a new tree structure, called Style Tree, which is proposed to capture the actual contents and the common layouts (or presentation styles) of the Web pages in a Web site [1].…”

Section: Related Workmentioning

confidence: 99%

“…Gupta et al [2] make use of DOM and prior settings to filter pages. And VIPS algorithm considers layout feature to extract all suitable blocks from DOM tree and then tries to find separators between extracted blocks [10].…”

Section: Related Workmentioning

confidence: 99%

A Novel Method to Extract Informative Blocks from Web Pages

Yang

2009

2009 International Joint Conference on Artificial Intelligence

View full text Add to dashboard Cite

This paper proposes a novel algorithm to extract the informative blocks from web pages and filter the advertisement which has noting to do with the subject when people browse the Web page. In this pager, we use HTML Parser to construct DOM tree and apply corresponding rules to construct a new tree (CST) which can easily help us to separate the "primary content blocks" from the other blocks. Then we will use our algorithm to analysis CST and trim off useless blocks which are on the CST. The algorithms can identify primary content blocks by looking for the blocks that contains much more numbers of the block content. Our system can extract web content which is existed as the Table format or the Div format well. At last, Experiments on a set of more than thousands of web pages from 5 different sites show that the method is practical, and can achieve high accuracy.

show abstract

DOM-based content extraction of HTML documents

Cited by 234 publications

References 9 publications

A Layout Based Detachment Approach for Extracting Content from Webpages

A Layout Based Detachment Approach for Extracting Content from Webpages

An FAR-SW based approach for webpage information extraction

A Novel Method to Extract Informative Blocks from Web Pages

Contact Info

Product

Resources

About