2011
DOI: 10.1109/tkde.2010.140
|View full text |Cite
|
Sign up to set email alerts
|

TEXT: Automatic Template Extraction from Heterogeneous Web Pages

Abstract: World Wide Web is the most useful source of information. In order to achieve high productivity of publishing, the webpages in many websites are automatically populated by using the common templates with contents. The templates provide readers easy access to the contents guided by consistent structures. However, for machines, the templates are considered harmful since they degrade the accuracy and performance of web applications due to irrelevant terms in templates. Thus, template detection techniques have rece… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
24
0

Year Published

2014
2014
2021
2021

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 44 publications
(24 citation statements)
references
References 22 publications
0
24
0
Order By: Relevance
“…So, different values are assigned to each tag and the total value is calculated which is used to decide whether a block is noise [3]. C. Kim and K. Shim propose a novel algorithm to extract template from numerous documents constructed by heterogeneous template [4].…”
Section: Content Extractionmentioning
confidence: 99%
“…So, different values are assigned to each tag and the total value is calculated which is used to decide whether a block is noise [3]. C. Kim and K. Shim propose a novel algorithm to extract template from numerous documents constructed by heterogeneous template [4].…”
Section: Content Extractionmentioning
confidence: 99%
“…Now, assume a web application"s page w ∈ W, we categorize it into some of likely template (L) and recover various output parameters (O) consequently. Procedure for drawing out patterns of web page or template embedded in web page has been obtainable in current state-of-techniques [13,14]. Although, we utilize some of the techniques from TEXT [13] that highlights the DOM tree of a page reflecting the required paths.…”
Section: Php Web Page Representationmentioning
confidence: 99%
“…Procedure for drawing out patterns of web page or template embedded in web page has been obtainable in current state-of-techniques [13,14]. Although, we utilize some of the techniques from TEXT [13] that highlights the DOM tree of a page reflecting the required paths. Our web page pattern mining technique comprises subsequent four stages.…”
Section: Php Web Page Representationmentioning
confidence: 99%
“…Note in the previous model this search space is equal to all other clusters in set C. Now we can start to find the first best merging pair in GetInitBestPair method. To calculate the MDL cost we have to calculate the number of 1's in MT,-1 & +1 in M∆ according to Lemma3 [8]. And finally we can merge the two clusters whose MDL cost is the minimum one.…”
Section: International Journal Of Computer Applications (0975 -8887) mentioning
confidence: 99%
“…To simplify the clustering calculation the web document is represented in DOM tree format [5]. After that essential paths of this web document is extracted and represented into the ME matrix [8]. Each column & row in ME represents the document and the essential paths of that document respectively.…”
Section: Examplementioning
confidence: 99%