2016
DOI: 10.1007/978-3-319-41579-6_4
|View full text |Cite
|
Sign up to set email alerts
|

Site-Level Web Template Extraction Based on DOM Analysis

Abstract: Abstract. One of the main development resources for website engineers are Web templates. Templates allow them to increase productivity by plugin content into already formatted and prepared pagelets. For the final user templates are also useful, because they provide uniformity and a common look and feel for all webpages. However, from the point of view of crawlers and indexers, templates are an important problem, because templates usually contain irrelevant information such as advertisements, menus, and banners… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
20
0

Year Published

2018
2018
2019
2019

Publication Types

Select...
1
1
1

Relationship

1
2

Authors

Journals

citations
Cited by 3 publications
(20 citation statements)
references
References 16 publications
(24 reference statements)
0
20
0
Order By: Relevance
“…• Limitations/Problems: The main limitation is that the evaluation was done with only 10 websites, and also the 24 webpages used were not randomly selected, but all of them implementing the template (this scenario is easier for a template extractor). TemEx (2015) [2]: As in RTDM-TD and RBM-TD, this algorithm also uses a mapping between the DOM trees to determine what nodes belong to all webpages, but it does not force a node to belong to all webpages, only to a subset, as in SST. In this respect, this algorithm is more democratic, because it uses a number of votes to determine that a node has been repeated in enough webpages to be considered as part of the template.…”
Section: Search Resultsmentioning
confidence: 99%
See 3 more Smart Citations
“…• Limitations/Problems: The main limitation is that the evaluation was done with only 10 websites, and also the 24 webpages used were not randomly selected, but all of them implementing the template (this scenario is easier for a template extractor). TemEx (2015) [2]: As in RTDM-TD and RBM-TD, this algorithm also uses a mapping between the DOM trees to determine what nodes belong to all webpages, but it does not force a node to belong to all webpages, only to a subset, as in SST. In this respect, this algorithm is more democratic, because it uses a number of votes to determine that a node has been repeated in enough webpages to be considered as part of the template.…”
Section: Search Resultsmentioning
confidence: 99%
“…The reason is that each technique has been implemented with a different language and with different components that affect the efficiency. But also, they have been evaluated with different evaluation criteria (e.g., counting retrieved words [53,54] vs. characters [25] vs. DOM nodes [2] vs. text blocks [47,55]) and with a different collection of benchmarks. Using different benchmarks to compare template extractors is unacceptable because some techniques used artificial [6] (automatically generated webpages sharing exactly the same template) and others used real webpages implemented by heterogenous designers [2,54].…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations
“…. 6.1 Web page of www.lemonde.fr's website and its main content (extracted with our web content extraction tool) . .…”
Section: 1mentioning
confidence: 99%