2014
DOI: 10.5120/15186-3546
|View full text |Cite
|
Sign up to set email alerts
|

Template Extraction from Heterogeneous Web Pages with Cosine Similarity

Abstract: Now a day's detection of templates from a large number of web pages has received a lot of attention. Template detection technique improves the performance of clustering, classification & search engines. In our work we proposed a novel algorithm by using cosine similarity based Template Extraction. We are using the cosine similarity approach to cluster the web documents. With the help of underlying structure of web documents we found the template for individual cluster. Our experimental evaluation show that our… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
0
0

Year Published

2017
2017
2021
2021

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(2 citation statements)
references
References 10 publications
0
0
0
Order By: Relevance
“…Regarding the template detection techniques, we found in the literature, some of them used artificial benchmarks [25], while others used real heterogeneous web pages [114,10]. Similarly, some authors selected the input web pages randomly [120,116,50], while others provided the input web pages manually [114,113]. Finally, regarding the block detection techniques, we found authors that used well-known benchmark suites such as CleanEval [20] benchmark suite [115,99], MSS [85], L3S-GN1 [61], etc.…”
Section: Conclusion and Future Work Conclusionmentioning
confidence: 99%
See 1 more Smart Citation
“…Regarding the template detection techniques, we found in the literature, some of them used artificial benchmarks [25], while others used real heterogeneous web pages [114,10]. Similarly, some authors selected the input web pages randomly [120,116,50], while others provided the input web pages manually [114,113]. Finally, regarding the block detection techniques, we found authors that used well-known benchmark suites such as CleanEval [20] benchmark suite [115,99], MSS [85], L3S-GN1 [61], etc.…”
Section: Conclusion and Future Work Conclusionmentioning
confidence: 99%
“…Using different collections of benchmarks to compare template detectors or content extractors is highly inaccurate because some techniques used artificial benchmarks [25] (automatically generated web pages that share exactly the same template) while others used real heterogenous web pages implemented by different designers [114,10,6]. In the same way, some authors selected the web pages randomly [120,116,50] possibly implementing different templates, while others manually provided web pages that implement exactly the same template [114,113]. Finally, other authors used well-known benchmark suites such as CleanEval [20] benchmark suite [115,99], MSS (Myriad 40 and Big 5) [85], L3S-GN1 [61], etc.…”
Section: Downloading and Configuring The Suitementioning
confidence: 99%