Importance of HTML Structural Elements and Metadata in Automated Subject Classification

Golub, Koraljka; Ardö, Anders

doi:10.1007/11551362_33

Cited by 28 publications

(21 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…By comparing automatically assigned classes to manually assigned ones at all the five levels of specificity (Ei has five hierarchical levels), the F1 measure was 0,26, whereas if comparison was done by reducing all the classes to the first two hierarchical levels, F1 was 0,59 (K. Golub and A. Ardö 2005). Also, an additional evaluation was performed, in which a subject expert evaluated both the automatically and manually assigned classes of a random sample of 109 Web pages.…”

Section: Algorithmmentioning

confidence: 99%

Automated subject classification of textual Web pages, based on a controlled vocabulary: Challenges and recommendations

Golub

2006

New Review of Hypermedia and Multimedia

Self Cite

View full text Add to dashboard Cite

The primary objective of this study was to identify and address problems of applying a controlled vocabulary in automated subject classification of textual Web pages, in the area of engineering. Web pages have special characteristics such as structural information, but are at the same time rather heterogeneous. The classification approach used comprises string-to-string matching between words in a term list extracted form the Ei (Engineering Information) thesaurus and classification scheme, and words in the text to be classified. Based on a sample of 70 Web pages, a number of problems with the term list are identified. Reasons for those problems are discussed and improvements proposed. Methods for implementing the improvements are also specified, suggesting further research.

show abstract

Section: Algorithmmentioning

confidence: 99%

Automated subject classification of textual Web pages, based on a controlled vocabulary: Challenges and recommendations

Golub

2006

New Review of Hypermedia and Multimedia

Self Cite

View full text Add to dashboard Cite

show abstract

“…Importance of HTML structural elements and metadata in automated subject classification is shown in paper [11]. The aim of the paper was to determine how significance indicators assigned to different Web page elements (internal metadata, title, headings, and main text) influence automated classification.…”

Section: A Related Workmentioning

confidence: 99%

Web page classification based on Schema.org collection

Krutil

Kudělka

Snášel

2012

2012 Fourth International Conference on Computational Aspects of Social Networks (CASoN)

View full text Add to dashboard Cite

The internet is a library of a huge amount of information and there is a need for categorize its content based on web page classification. Classification of web page content can improve the quality of web search and its accuracy. Unfortunately the high dimensionality of the web pages dataset has made the process of classification difficult. The use of an automatic method for web page classification can simplify the whole process and assist the search engine in getting more relevant results. Nowadays information on the web is generally structured and formatted in a not formal way. This absence of semantics leads to create formal methods to provide more semantics information into web page. Search engines including Bing, Google, Yahoo! and Yandex formed collection of schemas Schema.org to support web page semantics and improve their search results. This paper explores the use of formal source code structure for classifying a large collection of the web content. Is focused on use of schemas collection Schema.org to classify web pages and categorize them unambiguously.

show abstract

“…In [3], the use of information derived from HTML tags of a page for classification, is proposed. Similar method, in which the HTML tags are divided into three groups with different importance of terms in each group, is described in [4].…”

Section: A Term Weighting For Classificationmentioning

confidence: 99%

Text-Based Web Page Classification with Use of Visual Information

Bartík

2010

2010 International Conference on Advances in Social Networks Analysis and Mining

View full text Add to dashboard Cite

As the number of pages on the web is permanently increasing, there is a need to classify pages into categories to facilitate indexing or searching them. In the method proposed here, we use both textual and visual information to find a suitable representation of web page content. In this paper, several term weights, based on TF or TF-IDF weighting are proposed. Modification is based on visual areas, in which the text appears and their visual properties. Some results of experiments are included in the final part of the paper.

show abstract

Importance of HTML Structural Elements and Metadata in Automated Subject Classification

Cited by 28 publications

References 12 publications

Automated subject classification of textual Web pages, based on a controlled vocabulary: Challenges and recommendations

Automated subject classification of textual Web pages, based on a controlled vocabulary: Challenges and recommendations

Web page classification based on Schema.org collection

Text-Based Web Page Classification with Use of Visual Information

Contact Info

Product

Resources

About