This thesis explores how search log analysis can be used to gain a deeper understanding of online search behavior in curated collections by leveraging the metadata. For this, we use both the metadata of facets selected in the search interface and the metadata of the clicked documents. Our research is conducted using data from the National Library of the Netherlands, a typical digital library with a richly annotated historical newspaper collection and a faceted search interface. We investigate how to leverage metadata in three analytical settings.
First, we analyze the logs using specific metadata values, to study search within specified parts of the collection. The analysis shows that faceted search is common, we observe distinct search patterns in different parts of the collection, and are able to formulate concrete suggestions for improvement of the search system and collection management. This shows how metadata can be used to analyze search behavior in specific parts of a collection.
Second, we uncover user interests by clustering over the combined metadata values. We apply a clustering algorithm grouping sessions based on the metadata of selected facets and clicked documents. To evaluate resulting clusters, their stability over a six-month period is measured. The results show that user interests are stable, with the same interests reoccurring and the related search behavior varies per cluster. This demonstrates that a partitioning of sessions based on metadata, and an investigation of the related search behavior reveals specific user needs in specific parts of a collection, where in an overall analysis these patterns would disappear.
Third, we explore how to identify search for specific topics when no metadata directly describes these. We look into different, consecutive ways to build a term list as a topic representation: (i) using a knowledge resource, (ii) using local word embeddings trained on the collection, and (iii) by manual curation. Then we look into how to match the different term lists to search sessions: matching the terms to a) user queries, or b) clicked documents. We investigate two topics of societal relevance, WWII and feminism, and compare and discuss the combined methods in terms of number of retrieved sessions as well as estimated precision scores computed using manually created ground truths. With this work we provide insights into how different topic representations and matching approaches perform when retrieving topic-specific sessions.
Finally, we examine how to communicate results of such an analysis to collection owners and domain experts. We introduce MAGUS, a session visualization tool combining graphs to visualize search behavior with colors to visualize relevant metadata. Our design is new in combining both search interactions and metadata in a single visualization, allowing researchers and professionals to recognize different interaction patterns while at the same time providing insights into the parts of the collection a user is interested in. For the evaluation we conduct a user study comparing MAGUS with a table representation in three tasks completed by 12 participants from diverse backgrounds. Our study demonstrates that MAGUS enables participants to identify the part of the collection a user is interested in, and that it helps to distinguish different types of search behavior.
We expect that the presented methods can be used for other collections. Vertical search engines are ubiquitous and the search interfaces providing access to these systems are complex. We show how leveraging metadata in a search log analysis can enhance our understanding of how users are searching within different parts of a digital library. We were able to provide collection owners with recommendations about how to improve access to the collection.