Label propagation is one of the fastest methods for community detection with near linear time complexity. It is a local method where each node interacts with its neighbors to change its own label. Unfortunately, it has two major drawbacks. The first is a bad propagation, sometimes leading to huge communities without meaning (the giant communities problem). The second is related to its instability. Trials of a label propagation algorithm rarely give the same result. We propose to use a more stable variant of label propagation with a core method attached in order to obtain a more deterministic algorithm. This implementation will be done in a parallel and distributed environment on Hadoop using the MapReduce framework in order to apply this method with graphs having millions of nodes and edges. The main contribution of this paper is to model a parallel and distributed algorithm to achieve this purpose. A case study of the algorithm proposed is described at the end of the article along with the comparison of our results with other well-known algorithms.
INTRODUCTIONNetworks are powerful tools used to model real complex systems in many fields like biology (protein-protein interaction), anthropology, sports, the web, social networks, economics, fraud detection and risk clustering. Most of the networks that represent real complex systems show very specific characteristics with dense groups of nodes with many connections between nodes inside a group and few with the rest of the graph. These highly connected groups of nodes are called communities. Three main families can be distinguished in the field of community detection research: global, local and hybrid methods. Comparative analyses of these methods can be found in literature. [1][2][3][4] In this paper, we present a method to develop core label propagation using Hadoop, based on graph coloring. In Section 2, we describe the graph coloring problem followed by Section 3 which defines key label propagation issues, their variants and the main parallel and distributed algorithms found in the literature. Section 4 presents the parallel and distributed algorithm we propose for community detection. We present and discuss the results of our experiments on large graphs in Section 5. These results are compared with those obtained through the application of the main algorithms found in the literature. Finally, general conclusions and several future research paths are presented in Section 6.
GRAPH COLORING PROBLEMThe graph coloring problem is one of graph partitioning into k independent sets of nodes. Considering a graph G = (V, E), where V is the set of nodes and E stands for the set of edges, the graph coloring problem consists in partitioning V into a minimum number of color classes D 1 , … , D k where two directly linked nodes cannot have the same color. Finding k independent sets of nodes conforming to this constraint is called the k-coloring graph problem. More formally, a k-coloring of G can be defined by mapping f ∶ V → {1, 2, … , k} such that for every edge (u, v) ∈ E, f...