2015
DOI: 10.1007/s10618-015-0428-8
|View full text |Cite
|
Sign up to set email alerts
|

Exploiting link structure for web page genre identification

Abstract: As the World Wide Web grows at an unprecedented pace, web page genre identification has recently attracted increasing attention because of its importance in web search. A common approach for genre identification is to utilize textual features that can be extracted directly from the web page itself, i.e., On-Page features. The extracted features are subsequently given to a machine learning algorithm that will perform classification. However, these approaches may not be e↵ective when the web page contains limite… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
10
0

Year Published

2016
2016
2020
2020

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 20 publications
(10 citation statements)
references
References 33 publications
0
10
0
Order By: Relevance
“…The 7-web-genres dataset (Zhu et al, 2016) has a total of 1400 HTML pages in seven categories, i.e., blog, eshop, FAQ, online newspaper front page, listing, personal home page, and search page. They are functions of web pages.…”
Section: Experimental Results and Analysismentioning
confidence: 99%
See 3 more Smart Citations
“…The 7-web-genres dataset (Zhu et al, 2016) has a total of 1400 HTML pages in seven categories, i.e., blog, eshop, FAQ, online newspaper front page, listing, personal home page, and search page. They are functions of web pages.…”
Section: Experimental Results and Analysismentioning
confidence: 99%
“…Based on the work of computing the frequencies of different HTML tags (Zhu et al, 2016), we capture the structural features of web pages by constructing vectors according to the tree-like structure of HTML tags. In web pages, HTML tags are arranged in a tree-like structure.…”
Section: Extracting Structural and Textual Features Of Web Pagesmentioning
confidence: 99%
See 2 more Smart Citations
“…This project, which has been for two decades the largest publicly available Web directory, catalogs a huge number of web pages by means of a suitable taxonomy, each node containing web pages related to a specific topic. Categorizing web pages is an essential activity to improve user experience [22], particularly when classes are topics [23,24] and when the page at hand must be labeled as relevant or not [25]. In this scenario, both dataset has been generated from a pair of ODP categories, whose samples have been preprocessed for extracting the corresponding textual content.…”
Section: Resultsmentioning
confidence: 99%