-Incremental clustering is a very useful approach to organize dynamic text collections. Due to the time/space restrictions for incremental clustering, the textual documents must be preprocessed to maintain only their most important information. Domain independent statistical keyword extraction methods are useful in this scenario, since they analyze only the content of each document individually instead of all document collection, are fast and language independent. However, different methods have different assumptions about the properties of keywords in a text, and different methods extract different set of keywords. Different ways to structure a textual document for keyword extraction can also modify the set of extracted keywords. Furthermore, extracting a small number of keywords might degrade the incremental clustering quality and a large number of keywords might increase the clustering process speed. In this article we analyze different ways to structure a textual document for keyword extraction, different domain independent keyword extraction methods, and the impact of the number of keywords on the incremental clustering quality. We also define a framework for domain independent statistical keyword extraction which allows the user set different configurations in each step of the framework. This allows the user tunes the automatic keyword extraction according to its needs or some evaluation measure. A thorough experimental evaluation with several textual collections showed that the domain independent statistical keyword extraction methods obtains competitive results to the use of all terms or even selecting terms analyzing all the text collection. This is a promising evidence that favors computationally efficient methods for preprocessing in text streams or large textual collections.
In many text clustering tasks, there is some valuable knowledge about the problem domain, in addition to the original textual data involved in the clustering process. Traditional text clustering methods are unable to incorporate such additional (privileged) information into data clustering. Recently, a new paradigm called LUPI -Learning Using Privileged Information -was proposed by Vapnik to incorporate privileged information in classification tasks. In this paper, we extend the LUPI paradigm to deal with text clustering tasks. In particular, we show that the LUPI paradigm is potentially promising for incremental hierarchical text clustering, being very useful for organizing large textual databases. In our method, the privileged information about the text documents is applied to refine an initial clustering model by means of consensus clustering. The initial model is used for incremental clustering of the remaining text documents. We carried out an experimental evaluation on two benchmark text collections and the results showed that our method significantly improves the clustering accuracy when compared to a traditional hierarchical clustering method.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.