Abstract-Microblog summarization can save large amount of time for users in browsing. However, it is more challenging to summarize microblog than traditional documents due to the heavy noise and severe sparsity of posts. In this paper, we propose an unsupervised method named TR-LDA for summarizing microblog by cascading two key-bigram extractors based on TextRank and Latent Dirichlet Allocation (LDA). Cascading strategy contributes to a key-bigram set with better noise immunity. Two sentence ranking strategies are proposed based on the key-bigram set. Moreover, an approach of sentence extraction is proposed by merging two ranking results. Compared with some other text content based summarizers, the proposed method was shown to perform superiorly in experiments on Sina Weibo dataset.
IndexTerms-Key-Bigram, extraction, microblog summarization, sentence extraction, TR-LDA.
I. INTRODUCTIONMicroblog platforms such as Twitter and Sina Weibo have become part of our daily life, from which we can gain information timely to keep in touch with the world every now and then. However, sometimes we may sink into the massive information. A lot of time can be saved for users in browsing if microblog can be summarized automatically. Moreover, text analysis tasks such as classification, clustering and information retrieval can benefit from text summarization due to the reduction of dimensions.The purpose of this paper is to automatically extract several salient sentences from a set of topic related microblog posts to form a summary to summarize the core contents. From the perspective of traditional document summarization, it can be treated as a multi-document summarization problem by treating each post as a document or a single-document summarization problem by simply concatenating all posts as one document. However, the problem is still more intractable than summarizing any traditional documents, since microblog posts suffer from severe sparsity, heavy noise and bad normalization [1], while traditional documents are usually in nice structure and clear semantic. Most existing microblog summarization methods suffer from low precision.To overcome the above difficulties, we propose an unsupervised method named TR-LDA to summarize microblog by cascading key-bigram extractors. Unlike most existing methods [1]- [4], which are based on Bag-of-Words Manuscript received October 5, 2014; revised January 7, 2015. This work was supported in part by the National Natural Science Foundation of China (Grants No. 61203281 and No. 61303172).The authors are with the Institute of Automation, Chinese Academy of Sciences, 95 Zhongguancun East Road, Beijing, 100190, China (e-mail: yufang.wu@ia.ac.cn).(BoW) model to weight sentences or rank sentences directly based on text graph, our TR-LDA method generates summary by two main steps: 1) Extract a key-bigram set to discover the subtopics of the hot topic posts by cascading TextRank and LDA extractors; 2) Rank sentences based on the key-bigram set by two strategies and extract sentences by merging the two ...