We propose an interactive text document method, which is based on term labeling. The algorithm asks the user to cluster the top keyterms associated with document clusters iteratively. The keyterm clusters are used to guide the clustering method. Rather than using standard clustering algorithms, we propose a new text clusterer using term clusters. Terms that exist in a document corpus are clustered. Using a greedy approach, the term clusters are distilled in order to remove non-discriminative general terms. We then present a heuristic approach to extract seed documents associated with each distilled term cluster. These seeds are finally used to cluster all documents. We compared our interactive term labeling to a baseline interactive term selection algorithm on some real standard text datasets. The experiments show that with a comparable amount of user effort, our term labeling is more effective than the baseline term selection method.
The keyterm-based approach is arguably intuitive for users to direct text-clustering processes and adapt results to various applications in text analysis. Its way of markedly influencing the results, for instance, by expressing important terms in relevance order, requires little knowledge of the algorithm and has predictable effect, speeding up the task. This article first presents a text-clustering algorithm that can easily be extended into an interactive algorithm. We evaluate its performance against state-of-the-art clustering algorithms in unsupervised mode. Next, we propose three interactive versions of the algorithm based on keyterm labeling, document labeling, and hybrid labeling. We then demonstrate that keyterm labeling is more effective than document labeling in text clustering. Finally, we propose a visual approach to support the keyterm-based version of the algorithm. Visualizations are provided for the whole collection as well as for detailed views of document and cluster relationships. We show the effectiveness and flexibility of our framework,
Vis-Kt
, by presenting typical clustering cases on real text document collections. A user study is also reported that reveals overwhelmingly positive acceptance toward keyterm-based clustering.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.