Learn Locally, Correct Globally: A Distributed Algorithm for Training Graph Neural Networks

Ramezani, Morteza; Cong, Weilin; Mahdavi, Mehrdad; Kandemir, Mahmut; Sivasubramaniam, Anand

doi:10.48550/arxiv.2111.08202

Cited by 2 publications

(6 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…LLCG performs worst particularly for the Reddit dataset, because in the global server correction of LLCG, only a mini-batch is trained and it is not sufficient to correct the plain GCN. This is also the reason why the authors of LLCG report the performance of a complex model with mixing GCN layers and GraphSAGE layers [22]. DGL achieves good performance on some dataset (e.g., OGB-products) with uniform node sampling strategy and real-time embedding exchanging.…”

Section: Resultsmentioning

confidence: 99%

“…"Partition-based" generalizes the existing data parallelism techniques of classical distributed training on i.i.d data to graph data and enjoys minimal communication cost. However, directly partitioning a large graph into multiple subgraphs can result in severe information loss due to the ignorance of huge number of cross-subgraph edges and cause performance degeneration [1,14,22]. For these methods, the embedding of neighbors out of the current subgraph (second embedding set in Eq.…”

Section: Background and Problem Formulationmentioning

confidence: 99%

“…Existing methods in distributed training for GNNs can be classified into two categories, namely "partition-based" and "propagation-based", by how they tackle the trade-off between communication cost and information loss. "Partitionbased" methods [1,14,22] partition the graph into different subgraphs by dropping the edges across subgraphs. This way, a big training task on a large graph is decomposed into many smaller training tasks on many subgraphs in parallel, reducing communications among subgraphs, and thus, tasks, due to edge dropping.…”

Section: Introductionmentioning

confidence: 99%

“…In evaluation, we choose two state-of-the-art distributed training frameworks, one from each category as the baseline. For the first category, we choose LLCG[22], which partitions a graph into subgraphs and trains each subgraph strictly independently without incurring any communication among subgraphs. LLCG uses a central server to aggregate local models from each device and performs global training using mini-batches with full neighbor information to ensure that the model learns the global structure of the graph.…”

mentioning

confidence: 99%

See 3 more Smart Citations

Distributed Graph Neural Network Training with Periodic Historical Embedding Synchronization

Chai¹,

Guangji²,

Zhao³

et al. 2022

Preprint

View full text Add to dashboard Cite

Despite the recent success of Graph Neural Networks (GNNs), it remains challenging to train a GNN on large graphs (e.g., with over millions of nodes & billions of edges), which are prevalent in various graph-based applications such as social networks, recommender systems, and knowledge graphs. Traditional sampling-based methods accelerate GNN by dropping edges and nodes, which impairs the graph integrity and model performance. Differently, distributed GNN algorithms, which accelerate GNN training by utilizing multiple computing devices, can be classified into two types: "partition-based" methods enjoy low communication cost but suffer from information loss due to dropped edges, while "propagation-based" methods avoid information loss but suffer prohibitive communication overhead caused by neighbor explosion. To jointly address these problems, this paper proposes DIstributed Graph Embedding SynchronizaTion (DIGEST), a novel distributed GNN training framework that synergizes the complementary strength of both categories of existing methods. During subgraph parallel training, we propose to let each device store the historical embedding of its neighbors in other subgraphs. Therefore, our method does not discard any neighbors in other subgraphs (which leads to information loss), nor does it updates them intensively (which leads to communication cost). This effectively avoids (1) the intensive computation on explosively-increasing neighbors and (2) excessive communications across different devices. We proved that the approximation error induced by the staleness of historical embedding can be upper bounded and it does NOT affect the GNN model's expressiveness. More importantly, our convergence analysis demonstrates that DIGEST enjoys the state-of-the-art convergence rate. Extensive experimental evaluation on large, real-world graph datasets shows that DIGEST achieves up to 21.82× speedup without compromising the performance compared to state-of-the-art distributed GNN training frameworks.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Background and Problem Formulationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

Distributed Graph Neural Network Training with Periodic Historical Embedding Synchronization

Chai¹,

Guangji²,

Zhao³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…In such a training strategy, a partition is a mini-batch, and we call it a partition-based mini-batch. PSGD-PA [82] is a straightforward implementation of the above idea with a Parameter Server. In GraphTheta [65], the partitions are obtained via a community detection algorithm.…”

Section: Partition-based Mini-batch Generationmentioning

confidence: 99%

Distributed Graph Neural Network Training: A Survey

Shao¹,

Li²,

Gu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Graph neural networks (GNNs) are a type of deep learning models that learning over graphs, and have been successfully applied in many domains. Despite the effectiveness of GNNs, it is still challenging for GNNs to efficiently scale to large graphs. As a remedy, distributed computing becomes a promising solution of training large-scale GNNs, since it is able to provide abundant computing resources. However, the dependency of graph structure increases the difficulty of achieving high-efficiency distributed GNN training, which suffers from the massive communication and workload imbalance. In recent years, many efforts have been made on distributed GNN training, and an array of training algorithms and systems have been proposed. Yet, there is a lack of systematic review on the optimization techniques from graph processing to distributed execution. In this survey, we analyze three major challenges in distributed GNN training that are massive feature communication, the loss of model accuracy and workload imbalance. Then we introduce a new taxonomy for the optimization techniques in distributed GNN training that address the above challenges. The new taxonomy classifies existing techniques into four categories that are GNN data partition, GNN batch generation, GNN execution model, and GNN communication protocol. We carefully discuss the techniques in each category. In the end, we summarize existing distributed GNN systems for multi-GPUs, GPU-clusters and CPU-clusters, respectively, and give a discussion about the future direction on scalable GNNs.

show abstract

Learn Locally, Correct Globally: A Distributed Algorithm for Training Graph Neural Networks

Cited by 2 publications

References 17 publications

Distributed Graph Neural Network Training with Periodic Historical Embedding Synchronization

Distributed Graph Neural Network Training with Periodic Historical Embedding Synchronization

Distributed Graph Neural Network Training: A Survey

Contact Info

Product

Resources

About