Proceedings of the Twelfth International Conference on World Wide Web - WWW '03 2003
DOI: 10.1145/775152.775182
|View full text |Cite
|
Sign up to set email alerts
|

DOM-based content extraction of HTML documents

Abstract: Web pages often contain clutter (such as pop-up ads, unnecessary images and extraneous links) around the body of an article that distract a user from actual content. Extraction of "useful and relevant" content from web pages has many applications, including cell phone and PDA browsing, speech rendering for the visually impaired, and text summarization. Most approaches to removing clutter or making content more readable involve changing font size or removing HTML and data components such as images, which takes … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
30
0
1

Year Published

2005
2005
2015
2015

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 234 publications
(31 citation statements)
references
References 9 publications
0
30
0
1
Order By: Relevance
“…The user query form in the webpage provides a relational database of the website. Gupta et al (2003) discussed about the document object model tree, to achieve identification, content extraction and maintenance of the original data instead of summarizing it. Following, the paper Hammouda and Kamel (2004) presented two main parts of favorably document clustering process.…”
Section: Related Workmentioning
confidence: 99%
“…The user query form in the webpage provides a relational database of the website. Gupta et al (2003) discussed about the document object model tree, to achieve identification, content extraction and maintenance of the original data instead of summarizing it. Following, the paper Hammouda and Kamel (2004) presented two main parts of favorably document clustering process.…”
Section: Related Workmentioning
confidence: 99%
“…Important content can be extracted by analyzing the organization structure of the DOM tree and the properties of the DOM nodes. Gupta et al (2003) proposed a DOM-based model to extract main texts from HTML web pages. Their approach, working with the Document Object Model tree as opposed to raw HTML markup, can be used to perform main text extraction, identifying and preserving the original data instead of summarizing it.…”
Section: Related Workmentioning
confidence: 99%
“…In [2], Gupta et al propose a method to do extraction based on DOM (Document Object Model) tree. Lan Yi et al propose a new tree structure, called Style Tree, which is proposed to capture the actual contents and the common layouts (or presentation styles) of the Web pages in a Web site [1].…”
Section: Related Workmentioning
confidence: 99%
“…Gupta et al [2] make use of DOM and prior settings to filter pages. And VIPS algorithm considers layout feature to extract all suitable blocks from DOM tree and then tries to find separators between extracted blocks [10].…”
Section: Related Workmentioning
confidence: 99%