Result diversification is a retrieval strategy for dealing with ambiguous or multi-faceted queries by providing documents that cover as many facets of the query as possible. We propose a result diversification framework based on query-specific clustering and cluster ranking, in which diversification is restricted to documents belonging to clusters that potentially contain a high percentage of relevant documents. Empirical results show that the proposed framework improves the performance of several existing diversification methods. The framework also gives rise to a simple yet effective cluster-based approach to result diversification that selects documents from different clusters to be included in a ranked list in a round robin fashion. We describe a set of experiments aimed at thoroughly analyzing the behavior of the two main components of the proposed diversification framework, ranking and selecting clusters for diversification. Both components have a crucial impact on the overall performance of our framework, but ranking clusters plays a more important role than selecting clusters. We also examine properties that clusters should have in order for our diversification framework to be effective. Most relevant documents should be contained in a small number of high-quality clusters, while there should be no dominantly large clusters. Also, documents from these high-quality clusters should have a diverse content. These properties are strongly correlated with the overall performance of the proposed diversification framework.
IntroductionQueries submitted to Web search engines are often ambiguous or multi-faceted in the sense that they have multiple interpretations or sub-topics (Allan & Raghavan, 2002). For ambiguous queries, a typical example is the query "jaguar" that can refer to several interpretations including a kind of animal, a car brand, a type of cocktail, an operating system, etc. Multi-faceted queries are even more commonly seen in practice; for example, for the interpretation "jaguar car" of the query "jaguar", a wide range of sub-topics may be covered: models, prices, history of the company, etc. For such queries we often cannot be certain what the searcher's underlying information need is because of a lack of context. One retrieval strategy that attempts to cater for multiple interpretations of an ambiguous or multi-faceted query is to diversify the search results (Boyce, 1982;Goffman, 1964). Without explicit or implicit user feedback or history, the retrieval system makes an educated guess as to the possible facets of the query and presents as diverse a result list as possible by including documents pertaining to different facets of the query within the top-ranked documents.Recently, various result diversification methods have been proposed (Agrawal, Gollapudi, Halverson, & Ieong, 2009;Carbonell & Goldstein, 1998;Carterette & Chandar, 2009;Chen & Karger, 2006;Radlinski, Kleinberg, & Joachims, 2008;Santos, Macdonald, & Ounis, 2010;Zhai, Cohen, & Lafferty, 2003). Traditional retrieval strategies such ...