2009 International Joint Conference on Artificial Intelligence 2009
DOI: 10.1109/jcai.2009.156
|View full text |Cite
|
Sign up to set email alerts
|

A Novel Method to Extract Informative Blocks from Web Pages

Abstract: This paper proposes a novel algorithm to extract the informative blocks from web pages and filter the advertisement which has noting to do with the subject when people browse the Web page. In this pager, we use HTML Parser to construct DOM tree and apply corresponding rules to construct a new tree (CST) which can easily help us to separate the "primary content blocks" from the other blocks. Then we will use our algorithm to analysis CST and trim off useless blocks which are on the CST. The algorithms can ident… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
11
0

Year Published

2010
2010
2017
2017

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(11 citation statements)
references
References 10 publications
0
11
0
Order By: Relevance
“…Li et al use tags, such as <body>, <div> and <table>, as the standard of classification of the Web pages, as they believe that these tags fully reflect the visual effects of pages. So, different values are assigned to each tag and the total value is calculated which is used to decide whether a block is noise [3]. C. Kim and K. Shim propose a novel algorithm to extract template from numerous documents constructed by heterogeneous template [4].…”
Section: Content Extractionmentioning
confidence: 99%
“…Li et al use tags, such as <body>, <div> and <table>, as the standard of classification of the Web pages, as they believe that these tags fully reflect the visual effects of pages. So, different values are assigned to each tag and the total value is calculated which is used to decide whether a block is noise [3]. C. Kim and K. Shim propose a novel algorithm to extract template from numerous documents constructed by heterogeneous template [4].…”
Section: Content Extractionmentioning
confidence: 99%
“…Then, the block with the highest blockimportance value is determined as the main/content block. Examples are approaches proposed by Tseng & Kao (2006) and Li & Yang (2009). Tseng & Kao (2006) proposed some features for measuring the importance of a block, i.e.…”
Section: Related Workmentioning
confidence: 99%
“…The importance of a block is defined as the product of the three measures. Different from Tseng & Kao (2006), Li & Yang (2009) employ the attenuation quotient and the importance of HTML item and content item. The attenuation quotient is used to decrease the node importance when a node is closer to the root.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations