We present asymptotic and finite-sample results on the use of stochastic blockmodels for the analysis of network data. We show that the fraction of misclassified network nodes converges in probability to zero under maximum likelihood fitting when the number of classes is allowed to grow as the root of the network size and the average network degree grows at least poly-logarithmically in this size. We also establish finite-sample confidence bounds on maximum-likelihood blockmodel parameter estimates from data comprising independent Bernoulli random variates; these results hold uniformly over class assignment. We provide simulations verifying the conditions sufficient for our results, and conclude by fitting a logit parameterization of a stochastic blockmodel with covariates to a network data example comprising self-reported school friendships, resulting in block estimates that reveal residual structure.
Summary. Network data often take the form of repeated interactions between senders and receivers tabulated over time. A primary question to ask of such data is which traits and behaviors are predictive of interaction. To answer this question, a model is introduced for treating directed interactions as a multivariate point process: a Cox multiplicative intensity model using covariates that depend on the history of the process. Consistency and asymptotic normality are proved for the resulting partial-likelihood-based estimators under suitable regularity conditions, and an efficient fitting procedure is described. Multicast interactions-those involving a single sender but multiple receivers-are treated explicitly. The resulting inferential framework is then employed to model message sending behavior in a corporate e-mail network. The analysis gives a precise quantification of which static shared traits and dynamic network effects are predictive of message recipient selection.
In this paper we introduce the network histogram, a statistical summary of network interactions to be used as a tool for exploratory data analysis. A network histogram is obtained by fitting a stochastic blockmodel to a single observation of a network dataset. Blocks of edges play the role of histogram bins and community sizes that of histogram bandwidths or bin sizes. Just as standard histograms allow for varying bandwidths, different blockmodel estimates can all be considered valid representations of an underlying probability model, subject to bandwidth constraints. Here we provide methods for automatic bandwidth selection, by which the network histogram approximates the generating mechanism that gives rise to exchangeable random graphs. This makes the blockmodel a universal network representation for unlabeled graphs. With this insight, we discuss the interpretation of network communities in light of the fact that many different community assignments can all give an equally valid representation of such a network. To demonstrate the fidelity-versus-interpretability tradeoff inherent in considering different numbers and sizes of communities, we analyze two publicly available networks-political weblogs and student friendships-and discuss how to interpret the network histogram when additional information related to node and edge labeling is present. T he purpose of this paper is to introduce the network histogram-a nonparametric statistical summary obtained by fitting a stochastic blockmodel to a single observation of a network dataset. A key point of our construction is that it is not necessary to assume the data to have been generated by a blockmodel. This is crucial, because networks provide a general means of describing relationships between objects. Given n objects under study, a total of À n 2 Á pairwise relationships are possible. When only a small fraction of these relationships are present-as is often the case in modern high-dimensional data analysis across scientific fields-a network representation simplifies our understanding of this dependency structure.One fundamental characterization of a network comes through the identification of community structure (1), corresponding to groups of nodes that exhibit similar connectivity patterns. The canonical statistical model in this setting is the stochastic blockmodel (2): It posits that the probability of an edge between any two network nodes depends only on the community groupings to which those nodes belong. Grouping nodes together in this way serves as a natural form of dimensionality reduction: As n grows large, we cannot retain an arbitrarily complex view of all possible pairwise relationships. Describing how the full set of n objects interrelate is then reduced to understanding the interactions of k n communities. Studying the properties of fitted blockmodels is thus important (3, 4).Despite the popularity of the blockmodel, and its clear utility, scientists have observed that it often fails to describe all of the structure present in a network (5-8). I...
In digital imaging applications, data are typically obtained via a spatial subsampling procedure implemented as a color filter array-a physical construction whereby only a single color value is measured at each pixel location. Owing to the growing ubiquity of color imaging and display devices, much recent work has focused on the implications of such arrays for subsequent digital processing, including in particular the canonical demosaicking task of reconstructing a full color image from spatially subsampled and incomplete color data acquired under a particular choice of array pattern. In contrast to the majority of the demosaicking literature, we consider here the problem of color filter array design and its implications for spatial reconstruction quality. We pose this problem formally as one of simultaneously maximizing the spectral radii of luminance and chrominance channels subject to perfect reconstruction, and-after proving sub-optimality of a wide class of existing array patterns-provide a constructive method for its solution that yields robust, new panchromatic designs implementable as subtractive colors. Empirical evaluations on multiple color image test sets support our theoretical results, and indicate the potential of these patterns to increase spatial resolution for fixed sensor size, and to contribute to improved reconstruction fidelity as well as significantly reduced hardware complexity.
Spectral methods are of fundamental importance in statistics and machine learning, because they underlie algorithms from classical principal components analysis to more recent approaches that exploit manifold structure. In most cases, the core technical problem can be reduced to computing a low-rank approximation to a positive-definite kernel. For the growing number of applications dealing with very large or high-dimensional datasets, however, the optimal approximation afforded by an exact spectral decomposition is too costly, because its complexity scales as the cube of either the number of training examples or their dimensionality. Motivated by such applications, we present here 2 new algorithms for the approximation of positive-semidefinite kernels, together with error bounds that improve on results in the literature. We approach this problem by seeking to determine, in an efficient manner, the most informative subset of our data relative to the kernel approximation task at hand. This leads to two new strategies based on the Nyström method that are directly applicable to massive datasets. The first of these-based on sampling-leads to a randomized algorithm whereupon the kernel induces a probability distribution on its set of partitions, whereas the latter approach-based on sorting-provides for the selection of a partition in a deterministic way. We detail their numerical implementation and provide simulation results for a variety of representative problems in statistical data analysis, each of which demonstrates the improved performance of our approach relative to existing methods.statistical data analysis | kernel methods | low-rank approximation S pectral methods hold a central place in statistical data analysis. Indeed, the spectral decomposition of a positive-definite kernel underlies a variety of classical approaches such as principal components analysis (PCA), in which a low-dimensional subspace that explains most of the variance in the data is sought; Fisher discriminant analysis, which aims to determine a separating hyperplane for data classification; and multidimensional scaling (MDS), used to realize metric embeddings of the data. Moreover, the importance of spectral methods in modern statistical learning has been reinforced by the recent development of several algorithms designed to treat nonlinear structure in data-a case where classical methods fail. Popular examples include isomap (1), spectral clustering (2), Laplacian (3) and Hessian (4) eigenmaps, and diffusion maps (5). Though these algorithms have different origins, each requires the computation of the principal eigenvectors and eigenvalues of a positive-definite kernel.Although the computational cost (in both space and time) of spectral methods is but an inconvenience for moderately sized datasets, it becomes a genuine barrier as data sizes increase and new application areas appear. A variety of techniques, spanning fields from classical linear algebra to theoretical computer science (6), have been proposed to trade off analysis precision ...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.