Due to the explosive growth of short text on various social media platforms, short text stream clustering has become an increasingly prominent issue. Unlike traditional text streams, short text stream data present the following characteristics: short length, weak signal, high volume, high velocity, topic drift, etc. Existing methods cannot simultaneously address two major problems very well: inferring the number of topics and topic drift. Therefore, we propose a dynamic clustering algorithm for short text streams based on the Dirichlet process (DCSS), which can automatically learn the number of topics in documents and solve the topic drift problem of short text streams. To solve the sparsity problem of short texts, DCSS considers the correlation of the topic distribution at neighbouring time points and uses the inferred topic distribution of past documents as a prior of the topic distribution at the current moment while simultaneously allowing newly streamed documents to change the posterior distribution of topics. We conduct experiments on two widely used datasets, and the results show that DCSS outperforms existing methods and has better stability. IntroductionShort texts are prevalent on the Web, including on traditional websites, e.g., news titles and search snippets, and emerging social media, e.g., microblogs and tweets. In recent years, these data have swept the world at an alarming rate, and have produced large quantities of data streams, also called short text streams. Short text stream clustering [1] is challenging due to the inherent characteristics of short text steams such as short length, weak signal and high ambiguity of each short text, and the explosive growth and popularity of short textual content.Considering the characteristics of short text streams, it is difficult to apply traditional short text clustering approaches to model building because of the following two challenges: 1) insufficient information or statistical signals [2] in short
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.