Large datasets with interactions between objects are common to numerous scientific fields (i.e. social science, internet, biology. . . ). The interactions naturally define a graph and a common way to explore or summarize such dataset is graph clustering. Most techniques for clustering graph vertices just use the topology of connections ignoring informations in the vertices features. In this paper, we provide a clustering algorithm exploiting both types of data based on a statistical model with latent structure characterizing each vertex both by a vector of features as well as by its connectivity. We perform simulations to compare our algorithm with existing approaches, and also evaluate our method with real datasets based on hyper-textual documents. We find that our algorithm successfully exploits whatever information is found both in the connectivity pattern and in the features.imsart-generic ver.
In this paper we adapt online estimation strategies to perform model-based clustering on large networks. Our work focuses on two algorithms, the first based on the SAEM algorithm, and the second on variational methods. These two strategies are compared with existing approaches on simulated and real data. We use the method to decipher the connexion structure of the political websphere during the US political campaign in 2008. We show that our online EMbased algorithms offer a good trade-off between precision and speed, when estimating parameters for mixture distributions in the context of random graphs.
This paper presents an automated approach for building a metadata hierarchy of a set of web sites without the use of any predefined external hierarchies, and then merging and comparing them. The nodes of the hierarchy are the keywords of the specified web sites, and the links between these keywords are the weak subsumption relationships. We apply this method in the RTGI 1 project [8] on clusters of web sites already defined. The hierarchies can show how homogeneous each cluster is and permit to outline the contents of each corresponding cluster effectively. Moreover, we construct the common hierarchy of multiple clusters so that we check if their individual hierarchies are well distinguished and separated in the common one, which in turn indicates the correctness of clustering. At the end, we build the Semantic-hypertext graph of the sites which explains the semantic contents along with the topological structure of the sites.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.