Agglomerative Method for Texts Clustering

Orekhov, Andrey V.

doi:10.1007/978-3-030-17705-8_2

Cited by 6 publications

(5 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…One of the main problems at the cluster analysis is the determination of the preferred number of clusters. Finding the moment of completion of the process itself is associated with the solution of this issue [31,32]. Usually, the decision on the number of clusters is made during the process of clustering, but sometimes before it starts (for example, when using the k-means method) [2,3].…”

Section: Analytical Generalization Of the "Elbow Method" Heuristicmentioning

confidence: 99%

Quasi-Deterministic Processes with Monotonic Trajectories and Unsupervised Machine Learning

Orekhov

2021

Mathematics

Self Cite

View full text Add to dashboard Cite

This paper aims to consider approximation-estimation tests for decision-making by machine-learning methods, and integral-estimation tests are defined, which is a generalization for the continuous case. Approximation-estimation tests are measurable sampling functions (statistics) that estimate the approximation error of monotonically increasing number sequences in different classes of functions. These tests make it possible to determine the Markov moments of a qualitative change in the increase in such sequences, from linear to nonlinear type. If these sequences are trajectories of discrete quasi-deterministic random processes, then moments of change in the nature of their growth and qualitative change in the process match up. For example, in cluster analysis, approximation-estimation tests are a formal generalization of the “elbow method” heuristic. In solid mechanics, they can be used to determine the proportionality limit for the stress strain curve (boundaries of application of Hooke’s law). In molecular biology methods, approximation-estimation tests make it possible to determine the beginning of the exponential phase and the transition to the plateau phase for the curves of fluorescence accumulation of the real-time polymerase chain reaction, etc.

show abstract

Section: Analytical Generalization Of the "Elbow Method" Heuristicmentioning

confidence: 99%

Quasi-Deterministic Processes with Monotonic Trajectories and Unsupervised Machine Learning

Orekhov

2021

Mathematics

Self Cite

View full text Add to dashboard Cite

show abstract

“…The clustering process splits the X into pairwise disjoint subsets of X h called clusters: X = m h=1 X h , where for ∀ h, l | 1 ≤ h, l ≤ m : X h ∩ X l = ∅. Therefore, the map A defines an equivalence relation on X [15,16]. If an equivalence relation is given on some set X, not all set X can be considered, but only one element from each equivalence class.…”

Section: Approximation-estimation Testmentioning

confidence: 99%

“…It is the moment when the clustering process is complete. At such a moment, the values of the set of minimum distances are more accurately approximated by an incomplete quadratic parabola (without a linear term), rather than the direct line [15,16]. Within this approach, the iteration of the agglomerative process of clustering, at which there is a change in the nature of the increase in the function of minimum distances from linear to parabolic, is defined as the Markov stopping moment.…”

Section: Approximation-estimation Testmentioning

confidence: 99%

“…The clustering process is completed using the parabolic approximation-estimation test described above, which estimates the jumps of a monotonically increasing sequence of "trend set" values. The magnitude of the significant jump sufficient to stop the process depends on the sensitivity of the stopping criterion, which is set using the non-negative coefficient q [15,16]. The higher the value of q, the lower the criterion's sensitivity for stopping the clustering process.…”

Section: Clustering Stability and Determining The Preferred Number Of...mentioning

confidence: 99%

“…Clustering with Markov stopping time allows for automation of the procedure for determining the number of clusters in the text corpus. Based on the analysis of numerical experiments and general considerations, the following hypothesis was formulated earlier: "Preferred a number of clusters is formed at q ∈ Q e−2 " [16]. The main motive for formulating this hypothesis was that a chain effect is manifested in the interval of stable clustering Q e−1 , at which already formed clusters are combined.…”

Section: Clustering Stability and Determining The Preferred Number Of...mentioning

confidence: 99%

See 2 more Smart Citations

Topic Detection Based on Sentence Embeddings and Agglomerative Clustering with Markov Moment

et al. 2020

Self Cite

View full text Add to dashboard Cite

The paper is dedicated to solving the problem of optimal text classification in the areaof automated detection of typology of texts. In conventional approaches to topicality-based textclassification (including topic modeling), the number of clusters is to be set up by the scholar, andthe optimal number of clusters, as well as the quality of the model that designates proximity oftexts to each other, remain unresolved questions. We propose a novel approach to the automateddefinition of the optimal number of clusters that also incorporates an assessment of word proximityof texts, combined with text encoding model that is based on the system of sentence embeddings.Our approach combines Universal Sentence Encoder (USE) data pre-processing, agglomerativehierarchical clustering by Ward’s method, and the Markov stopping moment for optimal clustering.The preferred number of clusters is determined based on the “e-2” hypothesis. We set up anexperiment on two datasets of real-world labeled data: News20 and BBC. The proposed model istested against more traditional text representation methods, like bag-of-words and word2vec, to showthat it provides a much better-resulting quality than the baseline DBSCAN and OPTICS models withdifferent encoding methods. We use three quality metrics to demonstrate that clustering quality doesnot drop when the number of clusters grows. Thus, we get close to the convergence of text clusteringand text classification.

show abstract