Clustering is a fundamental analysis tool aiming at classifying data points into groups based on their similarity or distance. It has found successful applications in all natural and social sciences, including biology, physics, economics, chemistry, astronomy, psychology, and so on. Among numerous existent algorithms, hierarchical clustering algorithms are of a particular advantage as they can provide results under different resolutions without any predetermined number of clusters and unfold the organization of resulted clusters. At the same time, they suffer a variety of drawbacks and thus are either time-consuming or inaccurate. We propose a novel hierarchical clustering approach on the basis of a simple hypothesis that two reciprocal nearest data points should be grouped in one cluster. Extensive tests on data sets across multiple domains show that our method is much faster and more accurate than the state-of-the-art benchmarks. We further extend our method to deal with the community detection problem in real networks, achieving remarkably better results in comparison with the well-known Girvan-Newman algorithm.
The recent decade has witnessed an increasing popularity of recommendation systems, which help users acquire relevant knowledge, commodities, and services from an overwhelming information ocean on the Internet. Latent Dirichlet Allocation (LDA), originally presented as a graphical model for text topic discovery, now has found its application in many other disciplines. In this paper, we propose an LDA-inspired probabilistic recommendation method by taking the user-item collecting behavior as a two-step process: every user first becomes a member of one latent user-group at a certain probability and each user-group will then collect various items with different probabilities. Gibbs sampling is employed to approximate all the probabilities in the two-step process. The experiment results on three real-world data sets MovieLens, Netflix, and Last.fm show that our method exhibits a competitive performance on precision, coverage, and diversity in comparison with the other four typical recommendation methods. Moreover, we present an approximate strategy to reduce the computing complexity of our method with a slight degradation of the performance.
With the advent of big data era, social media plays an important role in many areas such as security and finance. Researchers pay more attention on mining users’ interests through the social media data. A three-layer model (TLM) based on keyword extracting is proposed to mine users’ interests, which includes candidate words extracting, semantic structures analysing and interest words ranking. The TLM mainly focuses on both self-importance and semantic-importance of interest words. In addition, the TLM also considers the interest drifting to track long-term and short-term interests of users. Experiments conducted on 10 SINA Weibo datasets show that TLM is more efficient than existing methods to identify users’ interests based on hit rate.
One of the main challenges for hierarchical clustering is how to appropriately identify the representative points in the lower level of the cluster tree, which are going to be utilized as the roots in the higher level of the cluster tree for further aggregation. However, conventional hierarchical clustering approaches have adopted some simple tricks to select the "representative" points which might not be as representative as enough. Thus, the constructed cluster tree is less attractive in terms of its poor robustness and weak reliability. Aiming at this issue, we propose a novel hierarchical clustering algorithm, in which, while building the clustering dendrogram, we can effectively detect the representative point based on scoring the reciprocal nearest data points in each sub-minimum-spanning-tree. Extensive experiments on UCI datasets show that the proposed algorithm is more accurate than other benchmarks. Meanwhile, under our analysis, the proposed algorithm has O(n log n) time-complexity and O(log n) space-complexity, indicating that it has the scalability in handling massive data with less time and storage consumptions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.