Abstract. In contrast to classic retrieval, where users search factual information, opinion retrieval deals with the search of subjective information. A major challenge in opinion retrieval is the informal style of writing and the use of domain-specific jargon to describe the opinion targets. In this paper, we present an automatic method to learn a space model for opinion retrieval. Our approach is a generative model that learns sentiment word distributions by embedding multi-level relevance judgments in the estimation of the model parameters. The model is learned using online Variational Inference, a recently published method that can learn from streaming data and can scale to very large datasets. Opinion retrieval and classification experiments on two large datasets with 703,000 movie reviews and 189,000 hotel reviews showed that the proposed method outperforms the baselines while using a significantly lower dimensional lexicon than other methods.
IntroductionThe increasing popularity of the WWW led to profound changes in people's habits. Search is now going beyond looking for factual information, and now people wish to search for the opinions of others to help them in their own decision-making [16,17]. In this new context, sentiment expressions or opinion expressions, are important pieces of information, specially, in the context of online commerce [12]. Therefore, modeling text to find meaningful words for expressing sentiments (sentiment lexicons) emerged as an important research direction [1,9,13,28]. In this work we investigate the viability of automatically generating a sentiment lexicon for opinion retrieval and sentiment classification applications. Some authors have tackled opinion retrieval by re-ranking search results with an expansion of sentiment words. For example, Zhang and Ye [28] describe how to use a generic and fixed sentiment lexicon to improve opinion retrieval through the maximization of a quadratic relation model between sentiment words and topic relevance. In contrast, Gerani et al.[9] applies a proximity-based opinion propagation method to calculate the opinion density at each point in a document. Later, Jo and Oh [13] proposed a unified aspect and sentiment model based on the assumption that each sentence concerns one aspect and all sentiment words in that sentence refer to that sentence. Finally, Aktolga and Allan [1] targeted the task of sentiment diversification in search results. The common element among these works [1,9,13,28] is the use of the