Large amounts of cultural heritage content have now been digitized and are available in digital libraries. However, these are often unstructured and difficult to navigate. Automatic techniques for identifying similar items in these collections could be used to improve navigation since it would allow items that are implicitly connected to be linked together and allow sets of similar items to be clustered. Europeana is a large digital library containing more than 20 million digital objects from a set of cultural heritage providers throughout Europe. The diverse nature of this collection means that the items do not have standard metadata to assist navigation.A range of methods for computing the similarity between pairs of texts are applied to metadata records in Europeana in order to estimate the similarity between items. Various methods for computing similarity have been proposed and can be classified into two main approaches: (1) knowledge-based, which make use of external knowledge sources and (2) corpus-based approaches, which rely on analyzing the frequency distributions of words in documents. Both techniques are evaluated against manual judgements obtained for this study and a multiple-choice test created from manually generated categories in cultural heritage collections. We find that a combination of corpus and knowledge-based approaches provide the best results in both experiments.
ACM Reference Format:Aletras, N., Stevenson, M., and Clough, P. 2012. Computing similarity between items in a digital library of cultural heritage. ACM J. Comput. Cult.