Detecting overlapping communities is essential to analyzing and exploring natural networks such as social networks, biological networks, and citation networks. However, most existing approaches do not scale to the size of networks that we regularly observe in the real world. In this paper, we develop a scalable approach to community detection that discovers overlapping communities in massive realworld networks. Our approach is based on a Bayesian model of networks that allows nodes to participate in multiple communities, and a corresponding algorithm that naturally interleaves subsampling from the network and updating an estimate of its communities. We demonstrate how we can discover the hidden community structure of several real-world networks, including 3.7 million US patents, 575,000 physics articles from the arXiv preprint server, and 875,000 connected Web pages from the Internet. Furthermore, we demonstrate on large simulated networks that our algorithm accurately discovers the true community structure. This paper opens the door to using sophisticated statistical models to analyze massive networks. Community detection is important for both exploring a network and predicting connections that are not yet observed. For example, by finding the communities in a large citation graph of scientific articles, we can make hypotheses about the fields and subfields that they contain. By finding communities in a large social network, we can more easily make predictions to individual members about who they might be friends with but are not yet connected to.In this paper, we develop an algorithm that discovers communities in modern real-world networks. The challenge is that real-world networks are massive-they can contain hundreds of thousands or even millions of nodes. We will examine a network of scientific articles that contains 575,000 articles, a network of connected Web pages that contains 875,000 pages, and a network of US patents that contains 3,700,000 patents. Most approaches to community detection cannot handle data at this scale.There are two fundamental difficulties to detecting communities in such networks. The first is that many existing community detection algorithms assume that each node belongs to a single community (1,(3)(4)(5)(6)(7)(14)(15)(16). In real-world networks, each node will likely belong to multiple communities and its connections will reflect these multiple memberships (2,(8)(9)(10)(11)(12)(13)17). For example, in a large social network, a member may be connected to coworkers, friends from school, and neighbors. We need algorithms that discover overlapping communities to capture the heterogeneity of each node's connections.The second difficulty is that existing algorithms are too slow. Many community detection algorithms iteratively analyze each pair of nodes, regardless of whether the nodes in the pair are connected in the network (5, 6, 10). Consequently, these algorithms run in time squared in the number of nodes, which makes analyzing massive networks computationally intractable. Other a...
A major goal of population genetics is to quantitatively understand variation of genetic polymorphisms among individuals. The aggregated number of genotyped humans is currently on the order millions of individuals, and existing methods do not scale to data of this size. To solve this problem we developed TeraStructure, an algorithm to fit Bayesian models of genetic variation in structured human populations on tera-sample-sized data sets (1012 observed genotypes, e.g., 1M individuals at 1M SNPs). TeraStructure is a scalable approach to Bayesian inference in which subsamples of markers are used to update an estimate of the latent population structure between samples. We demonstrate that TeraStructure performs as well as existing methods on current globally sampled data, and we show using simulations that TeraStructure continues to be accurate and is the only method that can scale to tera-sample-sizes.
We develop hierarchical Poisson matrix factorization (HPF) for recommendation. HPF models sparse user behavior data, large user/item matrices where each user has provided feedback on only a small subset of items. HPF handles both explicit ratings, such as a number of stars, or implicit ratings, such as views, clicks, or purchases. We develop a variational algorithm for approximate posterior inference that scales up to massive data sets, and we demonstrate its performance on a wide variety of real-world recommendation problems-users rating movies, users listening to songs, users reading scientific papers, and users reading news articles. Our study reveals that hierarchical Poisson factorization definitively outperforms previous methods, including nonnegative matrix factorization, topic models, and probabilistic matrix factorization techniques.
A major goal of population genetics is to quantitatively understand variation of genetic polymorphisms among individuals. The aggregated number of genotyped humans is currently on the order millions of individuals, and existing methods do not scale to data of this size. To solve this problem we developed TeraStructure, an algorithm to fit Bayesian models of genetic variation in structured human populations on tera-sample-sized data sets (10 12 observed genotypes, e.g., 1M individuals at 1M SNPs). TeraStructure is a scalable approach to Bayesian inference in which subsamples of markers are used to update an estimate of the latent population structure between samples. We demonstrate that TeraStructure performs as well as existing methods on current globally sampled data, and we show using simulations that TeraStructure continues to be accurate and is the only method that can scale to tera-sample-sizes.
Online services are typically replicated on multiple servers in different datacenters, and have (at best) a loose association with specific end-hosts or locations. To meet the needs of these online services, we introduce SCAFFOLD-an architecture that provides flowbased anycast with (possibly moving) service instances. SCAFFOLD allows addresses to change as end-points move, in order to retain the scalability advantages of hierarchical addressing. Successive refinement in resolving service names limits the scope of churn to ensure scalability, while in-band signaling of new addresses supports seamless communication as end-points move. We design, build, and evaluate a SCAFFOLD prototype that includes an end-host network stack (built as extensions to Linux and the BSD socket API) and a network infrastructure (built on top of OpenFlow and NOX). We demonstrate several applications, including a cluster of web servers, partitioned memcached servers, and migrating virtual machines, running on SCAFFOLD. Dynamism. Modern services operate in a dynamic environment, where a replica may fail, undergo maintenance, migrate to a new location, seek to offload work, or be powered down to save energy; new replicas may be added to handle extra load or tolerate faults. This dynamism stretches across many levels of granularityfrom connections, to virtual machines and physical hosts, to entire datacenters. Rather than hosts retaining their addresses as they move, SCAFFOLD allows end-point addresses to change dynamically. This allows networks to apply whatever hierarchical addressing scheme they wish for more scalable routing, and enables hosts to migrate across layer-two boundaries. Principle: The network addresses associated with a service should be able to change over time as service instances fail, recover, or move. When an end-point moves, SCAFFOLD performs inband signaling to update the remote end-points of established flows. When a service instance fails, recovers, or moves, the network automatically directs new requests to the new location. In contrast, today's network cannot easily allow end-point addresses to change because these addresses are exposed to (and cached by) applications.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.