Towards subjectifying text clustering

Dasgupta, Sajib; Ng, Vincent

doi:10.1145/1835449.1835530

Cited by 7 publications

(6 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Filter approaches [3] that do not assume specific form to the classifier but use a specific criterion to judge the relevance of individual features or feature subsets. The simplest approach is to rank features by a selection criterion (for instance, mutual information, which has been shown to perform well for document classification [28], weighted likelihood ratio [53], or other heuristics [20]) and select the top-ranked subset. Joint filter approaches that consider the dependency and (possible redundancy) among features [54].…”

Section: Feature Selectionmentioning

confidence: 99%

“…The reviews cover 5 topics 7 : movies, books, dvds, electronics, and kitchen; for each topic there are 2000 reviews with 1000 positive sentiment and 1000 negative sentiment reviews. Like previous usage of this datatset [53], we explore both topic and sentiment classification within a given topic to compare the feature selection algorithms, but for clustering we ignore the sentiment and only consider mixtures of different topics.…”

Section: Datasetsmentioning

confidence: 99%

“…We compare a range of feature selection algorithms with and without positivity constraints. These include ranking methods that simply select top-ranked features for different criteria: weighted log-likelihood ratio (WLLR) [53] (which is only defined for selecting positively correlated features), mutual information (MI), and the chi-squared statistic (CHI 2 ); and forward-selection algorithms using information theoretic criteria [5]: JMI [57], [58], MRMR [56], and CMIM [55]. To estimate these quantities, co-occurrence statistics are computed after the input feature vectors are transformed to binary vectors (removing any information about counts, weightings, and instance normalization).…”

Section: Comparison Of Feature Selection Performancementioning

confidence: 99%

See 2 more Smart Citations

Self-Tuned Descriptive Document Clustering Using a Predictive Network

Brockmeier

Ananiadou

et al. 2018

IEEE Trans. Knowl. Data Eng.

View full text Add to dashboard Cite

Descriptive clustering consists of automatically organizing data instances into clusters and generating a descriptive summary for each cluster. The description should inform a user about the contents of each cluster without further examination of the specific instances, enabling a user to rapidly scan for relevant clusters. Selection of descriptions often relies on heuristic criteria. We model descriptive clustering as an auto-encoder network that predicts features from cluster assignments and predicts cluster assignments from a subset of features. The subset of features used for predicting a cluster serves as its description. For text documents, the occurrence or count of words, phrases, or other attributes provides a sparse feature representation with interpretable feature labels. In the proposed network, cluster predictions are made using logistic regression models, and feature predictions rely on logistic or multinomial regression models. Optimizing these models leads to a completely self-tuned descriptive clustering approach that automatically selects the number of clusters and the number of features for each cluster. We applied the methodology to a variety of short text documents and showed that the selected clustering, as evidenced by the selected feature subsets, are associated with a meaningful topical organization.

show abstract

Section: Feature Selectionmentioning

confidence: 99%

Section: Datasetsmentioning

confidence: 99%

Section: Comparison Of Feature Selection Performancementioning

confidence: 99%

See 1 more Smart Citation

Self-Tuned Descriptive Document Clustering Using a Predictive Network

Brockmeier

Ananiadou

et al. 2018

IEEE Trans. Knowl. Data Eng.

View full text Add to dashboard Cite

show abstract

“…There has also been work in alternative clustering where the system constructs multiple clusterings and allows the SME to select between them let the user select between them [8]. Multiple clusterings can be constructed in many ways, for example by re-weighting features or changing the objective functions; however, such approaches by design require well-defined features, which may not always be attainable.…”

Section: Related Workmentioning

confidence: 99%

A Method to Accelerate Human in the Loop Clustering

Coden

Danilevsky

Gruhl

et al. 2017

Proceedings of the 2017 SIAM International Conference on Data Mining

View full text Add to dashboard Cite

Data analysis tasks often require grouping of information to identify trends and associations. However, as the number of elements rises to the hundreds and thousands the cost of having a person perform the groupings unassisted quickly becomes prohibitive. Previous approaches have combined traditional clustering techniques with manual interaction steps, yielding human-in-the-loop clustering algorithms that incorporate user feedback by reweighting features or adjusting a similarity function. But in the real world, many grouping tasks lack both a feature set and a well-defined (dis)similarity metric, having only a subject matter expert with an implicit understanding of the correct relationships between elements based on the domain and the task at hand.We present a refine-and-lock clustering interaction model and demonstrate its effectiveness for cognitiveassisted human clustering over other interaction models such as split/merge and must-link/can't-link. Our approach offers effective automatic clustering assistance even in the absence of clear features or a definitive similarity metric; ensures that every cluster has final user approval; and exhibits at least a 3.94x improvement over other interactive clustering approaches in time to completion.

show abstract

“…A text document clustering algorithm is proposed in [8], which is capable of producing multiple clusterings of the same data based on different point of views. Following a spectral clustering algorithm [21], a Laplacian matrix is generated using the cosine similarity among documents.…”

Section: Related Workmentioning

confidence: 99%

Interactive text document clustering using feature labeling

Nourashrafeddin

Milios

Arnold

2013

Proceedings of the 2013 ACM Symposium on Document Engineering

View full text Add to dashboard Cite

We propose an interactive text document method, which is based on term labeling. The algorithm asks the user to cluster the top keyterms associated with document clusters iteratively. The keyterm clusters are used to guide the clustering method. Rather than using standard clustering algorithms, we propose a new text clusterer using term clusters. Terms that exist in a document corpus are clustered. Using a greedy approach, the term clusters are distilled in order to remove non-discriminative general terms. We then present a heuristic approach to extract seed documents associated with each distilled term cluster. These seeds are finally used to cluster all documents. We compared our interactive term labeling to a baseline interactive term selection algorithm on some real standard text datasets. The experiments show that with a comparable amount of user effort, our term labeling is more effective than the baseline term selection method.

show abstract

Towards subjectifying text clustering

Cited by 7 publications

References 17 publications

Self-Tuned Descriptive Document Clustering Using a Predictive Network

Self-Tuned Descriptive Document Clustering Using a Predictive Network

A Method to Accelerate Human in the Loop Clustering

Interactive text document clustering using feature labeling

Contact Info

Product

Resources

About