A corpus (eg, patents or news texts) is an important knowledge resource that contains various topics, such as specific technologies or social events. Topic detection models of corpus, eg, Latent Dirichlet Allocation and KeyGraph, provide an important basis for exploring the status quo and trends in science, technology, or social events. However, these models suffer from low retrieval performance as they only consider text own explicit semantics in a single-domain corpus. In addition, many incremental models, such as online-LDA, are based on time slices. In this paper, a new topic detection model is proposed to improve the topic detection performance of a single-domain corpus, which is inspired by a human memory cognitive process (THC). First, to improve the accuracy, distributions over words and inter-word relations across a corpus are utilized as background knowledge, which is a type of implicit semantics, and we can find a more semantic-sensitive part of texts. Second, to realize online topic detection without time slices, we introduce a probability gain-based dynamic probabilistic model to detect latent topics by learning a model based on the dynamic human memory cognitive process. These two steps constitute the framework of our model. The experimental results for four public datasets (Reuters-R8, Reuters-R52, WebKB, and Cade12) reveal that our model is approximately ten percent higher than other baselines (eg, KeyGraph and LDA) on the Adjusted Rand Index (ARI). KEYWORDS memory cognitive process, probability gain, topic detection
INTRODUCTIONTopic modeling broadly refers to the identification of trends or themes in a curated document collection. A cluster of similar technologies refer to patent topics, 1 controversial events in news topics 2 and user's attitudes toward Twitter topics. 3 Patents topics can help researchers quickly analyze the status quo and trends of referred technologies, and news topics can help people fully understand controversial social events. Twitter topics can help governments supervise online public opinions.Many topic detection models were proposed to help people access topics in a corpus. Initial models for topic detection typically relied on clustering documents. 4 In these models, documents are represented as bag-of-words with traditional features, including term frequency, distribution over terms, and time feature. Based on the bag-of-words model, many models that typically compute similarities among documents have been developed for topic detection, such as CLARANS 5 and DBSCAN. 6 The next generation of topic detection models extended the analysis from directly clustering documents to clustering keywords. With extensive use of the Latent Dirichlet Allocation (LDA) model, 7 the Probabilistic Topic Model (PTM) has attracted considerable attention. 8 Several extended versions of PTM, which treat a topic as a distribution over keywords, have been employed for topic detection. 9 Recent research has addressed relations among keywords 10 because the sole use of keywords will lose a cons...