High quality, scalable and parallel community detection for large real graphs

Prat-Pérez, Arnau; Domínguez-Sal, David; Larriba-Pey, Josep Lluís

doi:10.1145/2566486.2568010

Cited by 119 publications

(85 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The algorithms used for this study were the Louvain method, 34 WalkTrap, 35 OSLOM, 36 SCD, 37 LPA, 6 BigClam, 30 Infomap 38 and SLPA. The algorithms used for this study were the Louvain method, 34 WalkTrap, 35 OSLOM, 36 SCD, 37 LPA, 6 BigClam, 30 Infomap 38 and SLPA.…”

Section: Methodsmentioning

confidence: 99%

Parallel and distributed core label propagation with graph coloring

Attal

Malek

Zolghadri

2017

Concurrency and Computation

View full text Add to dashboard Cite

Label propagation is one of the fastest methods for community detection with near linear time complexity. It is a local method where each node interacts with its neighbors to change its own label. Unfortunately, it has two major drawbacks. The first is a bad propagation, sometimes leading to huge communities without meaning (the giant communities problem). The second is related to its instability. Trials of a label propagation algorithm rarely give the same result. We propose to use a more stable variant of label propagation with a core method attached in order to obtain a more deterministic algorithm. This implementation will be done in a parallel and distributed environment on Hadoop using the MapReduce framework in order to apply this method with graphs having millions of nodes and edges. The main contribution of this paper is to model a parallel and distributed algorithm to achieve this purpose. A case study of the algorithm proposed is described at the end of the article along with the comparison of our results with other well-known algorithms. INTRODUCTIONNetworks are powerful tools used to model real complex systems in many fields like biology (protein-protein interaction), anthropology, sports, the web, social networks, economics, fraud detection and risk clustering. Most of the networks that represent real complex systems show very specific characteristics with dense groups of nodes with many connections between nodes inside a group and few with the rest of the graph. These highly connected groups of nodes are called communities. Three main families can be distinguished in the field of community detection research: global, local and hybrid methods. Comparative analyses of these methods can be found in literature. [1][2][3][4] In this paper, we present a method to develop core label propagation using Hadoop, based on graph coloring. In Section 2, we describe the graph coloring problem followed by Section 3 which defines key label propagation issues, their variants and the main parallel and distributed algorithms found in the literature. Section 4 presents the parallel and distributed algorithm we propose for community detection. We present and discuss the results of our experiments on large graphs in Section 5. These results are compared with those obtained through the application of the main algorithms found in the literature. Finally, general conclusions and several future research paths are presented in Section 6. GRAPH COLORING PROBLEMThe graph coloring problem is one of graph partitioning into k independent sets of nodes. Considering a graph G = (V, E), where V is the set of nodes and E stands for the set of edges, the graph coloring problem consists in partitioning V into a minimum number of color classes D 1 , … , D k where two directly linked nodes cannot have the same color. Finding k independent sets of nodes conforming to this constraint is called the k-coloring graph problem. More formally, a k-coloring of G can be defined by mapping f ∶ V → {1, 2, … , k} such that for every edge (u, v) ∈ E, f...

show abstract

Section: Methodsmentioning

confidence: 99%

Parallel and distributed core label propagation with graph coloring

Attal

Malek

Zolghadri

2017

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…This approach allows to compare overlapping clusters, but unlike GNMI we introduce in Section IV-C, it yields values that are incompatible with standard NMI [5] results. The Average F1 score is introduced in [7], [28] and a similar metric, NVD, is introduced in [9]. The Average F1 score belongs to the family of Cluster Matching Based Metrics and is described in Section IV-B.…”

Section: Related Workmentioning

confidence: 99%

Accuracy Evaluation of Overlapping and Multi-Resolution Clustering Algorithms on Large Datasets

Lutov

Khayati

Cudré-Mauroux

2019

2019 IEEE International Conference on Big Data and Smart Computing (BigComp)

View full text Add to dashboard Cite

Performance of clustering algorithms is evaluated with the help of accuracy metrics. There is a great diversity of clustering algorithms, which are key components of many data analysis and exploration systems. However, there exist only few metrics for the accuracy measurement of overlapping and multi-resolution clustering algorithms on large datasets. In this paper, we first discuss existing metrics, how they satisfy a set of formal constraints, and how they can be applied to specific cases. Then, we propose several optimizations and extensions of these metrics. More specifically, we introduce a new indexing technique to reduce both the runtime and the memory complexity of the Mean F1 score evaluation. Our technique can be applied on large datasets and it is faster on a single CPU than state-ofthe-art implementations running on high-performance servers. In addition, we propose several extensions of the discussed metrics to improve their effectiveness and satisfaction to formal constraints without affecting their efficiency. All the metrics discussed in this paper are implemented in C++ and are available for free as open-source packages that can be used either as stand-alone tools or as part of a benchmarking system to compare various clustering algorithms.

show abstract

“…However, mutual information-based measures are biased to a large numbers of clusters while GNMI does not have any bounded computational complexity in general. Therefore, amazon 238 3,237 339 3,177 681 3,005 155 247 1,055 37 337 dblp 225 3,909 373 3,435 717 2,879 167 247 1,394 36 373 youtube 737 4,815 1,052 --8,350 508 830 3,865 131 1,050 livejournal 5,038 -10,939 ---4,496 4,899 11,037 761 --denotes that the algorithm was terminated for violating the execution constraints; * the memory consumption and execution time for SCP are reported for a clique size k = 3 since they grow exponentially with k on dense networks, though accuracy was evaluated varying k ∈ 3..7. we evaluate clustering accuracy with F1h [45], a modification of the popular average F1-score (F1a) [40], [47] providing indicative values in the range [0, 0.5], since the artificial clusters formed from all combinations of the input nodes yield F 1a → 0.5 and F 1h → 0. First, we evaluate accuracy for all the deterministic algorithms listed in Table II on synthetic networks, and then evaluate both accuracy and efficiency for all clustering algorithms on real-world networks.…”

Section: Effectiveness and Efficiency Evaluationmentioning

confidence: 99%

DAOC: Stable Clustering of Large Networks

Lutov

Khayati

Cudré-Mauroux

2019

2019 IEEE International Conference on Big Data (Big Data)

View full text Add to dashboard Cite

Clustering is a crucial component of many data mining systems involving the analysis and exploration of various data. Data diversity calls for clustering algorithms to be accurate while providing stable (i.e., deterministic and robust) results on arbitrary input networks. Moreover, modern systems often operate with large datasets, which implicitly constrains the complexity of the clustering algorithm. Existing clustering techniques are only partially stable, however, as they guarantee either determinism or robustness. To address this issue, we introduce DAOC, a Deterministic and Agglomerative Overlapping Clustering algorithm. DAOC leverages a new technique called Overlap Decomposition to identify fine-grained clusters in a deterministic way capturing multiple optima. In addition, it leverages a novel consensus approach, Mutual Maximal Gain, to ensure robustness and further improve the stability of the results while still being capable of identifying micro-scale clusters. Our empirical results on both synthetic and real-world networks show that DAOC yields stable clusters while being on average 25% more accurate than state-of-the-art deterministic algorithms without requiring any tuning. Our approach has the ambition to greatly simplify and speed up data analysis tasks involving iterative processing (need for determinism) as well as data fluctuations (need for robustness) and to provide accurate and reproducible results.

show abstract

High quality, scalable and parallel community detection for large real graphs

Cited by 119 publications

References 25 publications

Parallel and distributed core label propagation with graph coloring

Parallel and distributed core label propagation with graph coloring

Accuracy Evaluation of Overlapping and Multi-Resolution Clustering Algorithms on Large Datasets

DAOC: Stable Clustering of Large Networks

Contact Info

Product

Resources

About