Abstract. In this paper information extraction method for the restaurant recommendation system is proposed. We aim at the development of an information extraction (IE) system which is intended to be a module of the recommendation system. The IE system is to gather information about different aspects of restaurants from online reviews, structure it and feed the recommendation module with the obtained data. The analyzed frames include service and food quality, cuisine, price level, noise level, etc. In this paper service quality, cuisine type and food quality are considered. As part of corpus preprocessing phase, a method for Russian reviews corpus analysis (as part of information extraction) is proposed. Its importance is shown at the experimental phase, when the application of machine learning techniques to aspects extraction is analyzed. It is shown that the ideas obtained at the corpus preprocessing stage can help to improve machine learning models performance.Keywords: corpus analysis, restaurant reviews, information extraction, recommendation system, machine learning.
IntroductionIn this paper information extraction (IE) method for the Russian restaurant recommendation system is proposed. It is based on the application of linguistic information gathered from corpus analysis and can be used for similar domains and underresourced languages. Our information extraction framework is a part of the project which aims at implementing restaurants recommendation system, and in this paper we consider two tasks: reviews corpus analysis and the application of machine learning techniques to the problem in question. During the latter task we use the information obtained at the corpus analysis phase. Our approach includes opinion mining since restaurant characteristics are both objective and subjective. Our corpus analysis method is based on non-contiguous bigrams and part of speech (POS) distribution analysis. Trigger words dictionaries are learnt using the bootstrapping method.
E. Pronoza et al.The frames to be extracted include service quality, food quality, cuisine type, price level, noise level, etc. Each frame has its own set of aspects. We suppose that the most important characteristics of a restaurant are service and food quality and cuisine type and therefore we only consider these three frames and focus on the extraction of their aspects. Such an assumption is proved by the distribution of the aspects in the data.We also suppose that the proposed IE system can be highly effective despite the difficulties imposed by the structure of a typical Russian restaurant review. Although the key information about restaurant characteristics does not always lie on the surface, tuning machine learning models according to the results of corpus analysis can help to improve the performance of an IE system.
Related WorkInformation extraction (IE) task as part of recommendation system development is discussed in [21]. The authors propose a rule-based approach to the extraction of key words from user's email. These keywords are put in...