Eirinaios Michelakis scite author profile

et al. 2010

Soc. Netw. Anal. Min.

Triangle counting is an important problem in graph mining. The clustering coefficient and the transitivity ratio, two commonly used measures effectively quantify the triangle density in order to quantify the fact that friends of friends tend to be friends themselves. Furthermore, several successful graph mining applications rely on the number of triangles in the graph.In this paper, we study the problem of counting triangles in large, power-law networks. Our algorithm, SPARSI-FYINGEIGENTRIANGLE , relies on the spectral properties of power-law networks and the Achlioptas-McSherry sparsification process. SPARSIFYINGEIGENTRIANGLE is easy to parallelize, fast, and accurate.We verify the validity of our approach with several experiments in real-world graphs, where we achieve at the same time high accuracy and considerable speedup versus a straight-forward exact counting competitor.Finally, our contributions include a simple method for making link recommendations in online social networks based on the number of triangles as well as a procedure for estimating triangles using sketches.

Uncertainty management in rule-based information extraction systems

Krishnamurthy

Haas

et al. 2009

Rule-based information extraction is a process by which structured objects are extracted from text based on userdefined rules. The compositional nature of rule-based information extraction also allows rules to be expressed over previously extracted objects. Such extraction is inherently uncertain, due to the varying precision associated with the rules used in a specific extraction task. Quantifying this uncertainty is crucial for querying the extracted objects in probabilistic databases, and for improving the recall of extraction tasks that use compositional rules. In this paper, we provide a probabilistic framework for handling the uncertainty in rule-based information extraction. Specifically, for each extraction task, we build a parametric exponential model of uncertainty that captures the interaction between the different rules, as well as the compositional nature of the rules; the exponential form of our model follows from maximum-entropy considerations. We also give modeldecomposition techniques that make the learning algorithms scalable to large numbers of rules and constraints. Experiments over multiple real-world extraction tasks confirm that our approach yields accurate probability estimates with only a small performance overhead. Moreover, our framework supports incremental pay-as-you-go improvements in the accuracy of probability estimates as new rules, data, or constraints are added.

Spectral Counting of Triangles in Power-Law Networks via Element-Wise Sparsification

Tsourakakis

Drineas

et al. 2009

Triangle counting is an important problem in graph mining. The clustering coefficient and the transitivity ratio, two commonly used measures effectively quantify the triangle density in order to quantify the fact that friends of friends tend to be friends themselves. Furthermore, several successful graph mining applications rely on the number of triangles.In this paper, we study the problem of counting triangles in large, power-law networks. Our algorithm, SPARSI-FYINGEIGENTRIANGLE , relies on the spectral properties of power-law networks and the Achlioptas-McSherry sparsification process. SPARSIFYINGEIGENTRIANGLE is easy to parallelize, fast and accurate.We verify the validity of our approach with several experiments in real-world graphs, where we achieve at the same time high accuracy and important speedup versus a straight-forward exact counting competitor.

Probabilistic declarative information extraction

Wang

Franklin

et al. 2010

BayesStore

et al. 2008

Several real-world applications need to effectively manage and reason about large amounts of data that are inherently uncertain. For instance, pervasive computing applications must constantly reason about volumes of noisy sensory readings for a variety of reasons, including motion prediction and human behavior modeling. Such probabilistic data analyses require sophisticated machine-learning tools that can effectively model the complex spatio/temporal correlation patterns present in uncertain sensory data. Unfortunately, to date, most existing approaches to probabilistic database systems have relied on somewhat simplistic models of uncertainty that can be easily mapped onto existing relational architectures: Probabilistic information is typically associated with individual data tuples, with only limited or no support for effectively capturing and reasoning about complex data correlations. In this paper, we introduce BayesStore, a novel probabilistic data management architecture built on the principle of handling statistical models and probabilistic inference tools as first-class citizens of the database system. Adopting a machine-learning view, BAYESSTORE employs concise statistical relational models to effectively encode the correlation patterns between uncertain data, and promotes probabilistic inference and statistical model manipulation as part of the standard DBMS operator repertoire to support efficient and sound query processing. We present BAYESSTORE's uncertainty model based on a novel, first-order statistical model , and we redefine traditional query processing operators, to manipulate the data and the probabilistic models of the database in an efficient manner. Finally, we validate our approach, by demonstrating the value of exploiting data correlations during query processing, and by evaluating a number of optimizations which significantly accelerate query processing.