Exploiting link structure for web page genre identification

Zhu, Jia; Xie, Qing; Yu, Shoou-I; Wong, Wendy H.

doi:10.1007/s10618-015-0428-8

Cited by 20 publications

(10 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The 7-web-genres dataset (Zhu et al, 2016) has a total of 1400 HTML pages in seven categories, i.e., blog, eshop, FAQ, online newspaper front page, listing, personal home page, and search page. They are functions of web pages.…”

Section: Experimental Results and Analysismentioning

confidence: 99%

“…Based on the work of computing the frequencies of different HTML tags (Zhu et al, 2016), we capture the structural features of web pages by constructing vectors according to the tree-like structure of HTML tags. In web pages, HTML tags are arranged in a tree-like structure.…”

Section: Extracting Structural and Textual Features Of Web Pagesmentioning

confidence: 99%

“…The classification results of dataset B are sorted as dataset B * according to the maximum predicted scores from high to low. For example, for the 7-web-genres dataset (Zhu et al, 2016), Table 1 lists 10 results with the low predicted score for dataset B * , and Table 2 lists 10 results with the high predicted score for dataset B * . Numbers 1-7 represent seven categories, and the real categories are the known labels in the dataset.…”

Section: Acquisition Of Confidencementioning

confidence: 99%

“…The classification performance can be improved by combining multiple classifiers, such as voting, bagging, and boosting (Baskin et al, 2017). Zhu et al (2016) used a decision matrix to construct a model with multiple SVM classifiers to classify web pages, but the combination of different types of high-performance classifiers could be better. Elsalmy et al (2017) enhanced the predictive power of web page classification models by stacking, but stacking is complicated.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Web page classification based on heterogeneous features and a combination of multiple classifiers

Deng

Shen

2020

Front Inform Technol Electron Eng

View full text Add to dashboard Cite

Precise web page classification can be achieved by evaluating features of web pages, and the structural features of web pages are effective complements to their textual features. Various classifiers have different characteristics, and multiple classifiers can be combined to allow classifiers to complement one another. In this study, a web page classification method based on heterogeneous features and a combination of multiple classifiers is proposed. Different from computing the frequency of HTML tags, we exploit the tree-like structure of HTML tags to characterize the structural features of a web page. Heterogeneous textual features and the proposed tree-like structural features are converted into vectors and fused. Confidence is proposed here as a criterion to compare the classification results of different classifiers by calculating the classification accuracy of a set of samples. Multiple classifiers are combined based on confidence with different decision strategies, such as voting, confidence comparison, and direct output, to give the final classification results. Experimental results demonstrate that on the Amazon dataset, 7-web-genres dataset, and DMOZ dataset, the accuracies are increased to 94.2%, 95.4%, and 95.7%, respectively. The fusion of the textual features with the proposed structural features is a comprehensive approach, and the accuracy is higher than that when using only textual features. At the same time, the accuracy of the web page classification is improved by combining multiple classifiers, and is higher than those of the related web page classification algorithms.

show abstract

Section: Experimental Results and Analysismentioning

confidence: 99%

Section: Extracting Structural and Textual Features Of Web Pagesmentioning

confidence: 99%

Section: Acquisition Of Confidencementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Web page classification based on heterogeneous features and a combination of multiple classifiers

Deng

Shen

2020

Front Inform Technol Electron Eng

View full text Add to dashboard Cite

show abstract

“…This project, which has been for two decades the largest publicly available Web directory, catalogs a huge number of web pages by means of a suitable taxonomy, each node containing web pages related to a specific topic. Categorizing web pages is an essential activity to improve user experience [22], particularly when classes are topics [23,24] and when the page at hand must be labeled as relevant or not [25]. In this scenario, both dataset has been generated from a pair of ODP categories, whose samples have been preprocessed for extracting the corresponding textual content.…”

Section: Resultsmentioning

confidence: 99%

Phi-Delta-Diagrams: Software Implementation of a Visual Tool for Assessing Classifier and Feature Performance

Armano

Giuliani

Neumann

et al. 2018

MAKE

View full text Add to dashboard Cite

Abstract:In this article, a two-tiered 2D tool is described, called ϕ, δ diagrams, and this tool has been devised to support the assessment of classifiers in terms of accuracy and bias. In their standard versions, these diagrams provide information, as the underlying data were in fact balanced. Their generalization, i.e., ability to account for the imbalance, will be also briefly described. In either case, the isometrics of accuracy and bias are immediately evident therein, as-according to a specific design choice-they are in fact straight lines parallel to the x-axis and y-axis, respectively. ϕ, δ diagrams can also be used to assess the importance of features, as highly discriminant ones are immediately evident therein. In this paper, a comprehensive introduction on how to adopt ϕ, δ diagrams as a standard tool for classifier and feature assessment is given. In particular, with the goal of illustrating all relevant details from a pragmatic perspective, their implementation and usage as Python and R packages will be described.

show abstract

SCHOLAT: An Innovative Academic Information Service Platform

Tang

Zhu

et al. 2016

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Exploiting link structure for web page genre identification

Cited by 20 publications

References 33 publications

Web page classification based on heterogeneous features and a combination of multiple classifiers

Web page classification based on heterogeneous features and a combination of multiple classifiers

Phi-Delta-Diagrams: Software Implementation of a Visual Tool for Assessing Classifier and Feature Performance

SCHOLAT: An Innovative Academic Information Service Platform

Contact Info

Product

Resources

About