Scalable distributed subgraph enumeration

Lai, Longbin; Qin, Lu; Lin, Xuemin; Zhang, Ying; Chang, Lijun; Yang, Shiyu

doi:10.14778/3021924.3021937

Cited by 86 publications

(98 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…If E is still true after the local verification, we add (u, v) to f (Line 13). Then we create a new trie node N for v with N as its parentN (Line 14,15). After that, if f grows to an EC of Pi , then for each undetermined edge e of f (both end vertices are not in the local machine), we add N to I[e] (Line 17, 18).…”

Section: Algorithm 1: Expandembedtriementioning

confidence: 99%

“…Since join-based methods need to group the intermediate results based on keys so as to join them together, the performance was significantly dragged down when dealing with sparse graphs compared with RADS and PSgL. It is worth noting that PSgL was verified slower than TwinTwig and SEED in [13] [15]. This may be because the datasets used in TwinTwig and SEED are much denser than RoadNet, hence a huge number of embeddings will be generated.…”

Section: Exp-1:roadnetmentioning

confidence: 99%

“…It is observed in previous work [15,18] that when the data graph is large, the number of intermediate results can be huge, making the network communication cost a bottleneck and causing memory crash. On the other hand, systems that rely on replication of large parts of the data graph or heavy indexes are impractical for large data graphs and lowend computer clusters.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Fast and robust distributed subgraph enumeration

et al. 2019

View full text Add to dashboard Cite

We study the classic subgraph enumeration problem under distributed settings. Existing solutions either suffer from severe memory crisis or rely on large indexes, which makes them impractical for very large graphs. Most of them follow a synchronous model where the performance is often bottlenecked by the machine with the worst performance. Motivated by this, in this paper, we propose RADS, a Robust Asynchronous Distributed Subgraph enumeration system. RADS first identifies results that can be found using singlemachine algorithms. This strategy not only improves the overall performance but also reduces network communication and memory cost. Moreover, RADS employs a novel region-grouped multi-round expand verify & filter framework which does not need to shuffle and exchange the intermediate results, nor does it need to replicate a large part of the data graph in each machine. This feature not only reduces network communication cost and memory usage, but also allows us to adopt simple strategies for memory control and load balancing, making it more robust. Several heuristics are also used in RADS to further improve the performance. Our experiments verified the superiority of RADS to state-of-the-art subgraph enumeration approaches.

show abstract

Section: Algorithm 1: Expandembedtriementioning

confidence: 99%

Section: Exp-1:roadnetmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Fast and robust distributed subgraph enumeration

et al. 2019

View full text Add to dashboard Cite

show abstract

“…In doing so, we also achieve two things. First, we compare our work to the recent SEED [36] work, which develops efficient optimizations for evaluating undirected subgraph queries in the distributed setting. Second, by implementing one of the optimizations, we demonstrate that our approach can take as input general relations instead of the binary edge(ai, aj) relations we used so far.…”

Section: Generality and Specializationsmentioning

confidence: 99%

Distributed evaluation of subgraph queries using worst-case optimal low-memory dataflows

et al. 2018

View full text Add to dashboard Cite

We study the problem of finding and monitoring fixed-size subgraphs in a continually changing large-scale graph. We present the first approach that (i) performs worst-case optimal computation and communication, (ii) maintains a total memory footprint linear in the number of input edges, and (iii) scales down per-worker computation, communication, and memory requirements linearly as the number of workers increases, even on adversarially skewed inputs.Our approach is based on worst-case optimal join algorithms, recast as a data-parallel dataflow computation. We describe the general algorithm and modifications that make it robust to skewed data, prove theoretical bounds on its resource requirements in the massively parallel computing model, and implement and evaluate it on graphs containing as many as 64 billion edges. The underlying algorithm and ideas generalize from finding and monitoring subgraphs to the more general problem of computing and maintaining relational equi-joins over dynamic relations. Edge-at-a-time ApproachesPerhaps the most common approach to finding instances of a query subgraph is to treat it as a relational query, and to execute a sequence of binary joins to determine the result. For example,

show abstract

“…Below we refer to some interesting representative examples. Methods, such as TwinTwig [22], sTwig [6] and SEED [23] deal with a single, very large graph, stored in a distributed infrastructure, and rely on parallel computing algorithms and infrastructures to perform the sub-iso testing. Methods, like iGQ [24] and GraphCache [25], employ caching on top of any proposed FTV method to improve performance and study the architecture, system and algorithms for a graph cache for subgraph queries for FTV and SI methods.…”

Section: Background a Related Workmentioning

confidence: 99%