This is the author’s version of a work that was accepted for publication in Web Semantics: Science, Services and Agents on the World Wide Web. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Web Semantics: Science, Services and Agents on the World Wide Web, 9, 1, (2011) DOI: 10.1016/j.websem.2010.10.001In social tagging systems, users have different purposes when they annotate items. Tags not only depict the content of the annotated items, for example by listing the objects that appear in a photo, or express contextual information about the items, for example by providing the location or the time in which a photo was taken, but also describe subjective qualities and opinions about the items, or can be related to organisational aspects, such as self-references and personal tasks.
Current folksonomy-based search and recommendation models exploit the social tag space as a whole to retrieve those items relevant to a tag-based query or user profile, and do not take into consideration the purposes of tags. We hypothesise that a significant percentage of tags are noisy for content retrieval, and believe that the distinction of the personal intentions underlying the tags may be beneficial to improve the accuracy of search and recommendation processes.
We present a mechanism to automatically filter and classify raw tags in a set of purpose-oriented categories. Our approach finds the underlying meanings (concepts) of the tags, mapping them to semantic entities belonging to external knowledge bases, namely WordNet and Wikipedia, through the exploitation of ontologies created within the W3C Linking Open Data initiative. The obtained concepts are then transformed into semantic classes that can be uniquely assigned to content- and context-based categories. The identification of subjective and organisational tags is based on natural language processing heuristics.
We collected a representative dataset from Flickr social tagging system, and conducted an empirical study to categorise real tagging data, and evaluate whether the resultant tags categories really benefit a recommendation model using the Random Walk with Restarts method. The results show that content- and context-based tags are considered superior to subjective and organisational tags, achieving equivalent performance to using the whole tag space.This research was supported by the European Commission (SALERO, FP6-027122), by the Spanish Ministry of Science and Education (RIM3, TIN2008-06566-C04-02), and by the Spanish Ministry of Industry (i3media, CENIT-2007-1012)