2009
DOI: 10.1007/s11280-009-0059-3
|View full text |Cite
|
Sign up to set email alerts
|

On Finding Templates on Web Collections

Abstract: Templates are pieces of HTML code common to a set of web pages usually adopted by content providers to enhance the uniformity of layout and navigation of theirs Web sites. They are usually generated using authoring/publishing tools or by programs that build HTML pages to publish content from a database. In spite of their usefulness, the content of templates can negatively affect the quality of results produced by systems that automatically process information available in web sites, such as search engines, clu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
31
1

Year Published

2010
2010
2018
2018

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 17 publications
(32 citation statements)
references
References 26 publications
0
31
1
Order By: Relevance
“…Despite the general belief that template removal may improve the quality of search results (VIEIRA;PINTO, 2006;YI;LIU;LI, 2003), our experiments suggest the improvements are not consistent across various sites, and that template removal in general does not cause improvements on quality of results provided by search systems. Indeed, in three of the test collections used in the intrasite scenario, the results obtained with template removal were equivalent to results without template removal, for the queries we experimented with.…”
Section: Introductioncontrasting
confidence: 82%
See 1 more Smart Citation
“…Despite the general belief that template removal may improve the quality of search results (VIEIRA;PINTO, 2006;YI;LIU;LI, 2003), our experiments suggest the improvements are not consistent across various sites, and that template removal in general does not cause improvements on quality of results provided by search systems. Indeed, in three of the test collections used in the intrasite scenario, the results obtained with template removal were equivalent to results without template removal, for the queries we experimented with.…”
Section: Introductioncontrasting
confidence: 82%
“…Many authors have hypothesized that templates are typically not related to the main content of the pages and might hurt search quality (BARYOSSEF; RAJAGOPALAN, 2002;VIEIRA;PINTO, 2006;YI;LIU;LI, 2003). Such an idea has motivated the proposal of algorithms to detect and remove templates which led to gains according to preliminary evaluations (BARYOSSEF; RAJAGOPALAN, 2002; VIEIRA; PINTO, 2006; YI; LIU; LI, 2003).…”
Section: Related Workmentioning
confidence: 99%
“…This was computed by calculating a CS of size 3. We observed that other techniques such as [4,10,7,9,11] obtain good values of F1 in certain webpages, but they are manually feed with collections of webpages that share the same template. With this conditions, our tool produces an F1 close to 95% in most of the cases.…”
Section: Methodsmentioning
confidence: 99%
“…Some of them measure the number of words correctly retrieved [10,9]. This can be rather imprecise, because it ignores the structure (e.g., div, table...) retrieved.…”
Section: Methodsmentioning
confidence: 99%
“…Towards information retrieval, template detection can positively impact performance and resource usage in processes of analysis of HTML pages [5]. Regarding Web page Clustering, templates could help in cluster structurally similar Web pages [4].…”
Section: Template Detectionmentioning
confidence: 99%