GNNPipe: Scaling Deep GNN Training with Pipelined Model Parallelism

Chen, Jingji; Chen, Zhuoming; Qian, Xuehai

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2308.10087 (cs)

[Submitted on 19 Aug 2023 (v1), last revised 24 Sep 2023 (this version, v2)]

Title:GNNPipe: Scaling Deep GNN Training with Pipelined Model Parallelism

Authors:Jingji Chen, Zhuoming Chen, Xuehai Qian

View PDF

Abstract:Communication is a key bottleneck for distributed graph neural network (GNN) training. This paper proposes GNNPipe, a new approach that scales the distributed full-graph deep GNN training. Being the first to use layer-level model parallelism for GNN training, GNNPipe partitions GNN layers among GPUs, each device performs the computation for a disjoint subset of consecutive GNN layers on the whole graph. Compared to graph parallelism with each GPU handling a graph partition, GNNPipe reduces the communication volume by a factor of the number of GNN layers. GNNPipe overcomes the unique challenges for pipelined layer-level model parallelism on the whole graph by partitioning it into dependent chunks, allowing the use of historical vertex embeddings, and applying specific training techniques to ensure convergence. We also propose a hybrid approach by combining GNNPipe with graph parallelism to handle large graphs, achieve better computer resource utilization and ensure model convergence. We build a general GNN training system supporting all three parallelism setting. Extensive experiments show that our method reduces the per-epoch training time by up to 2.45x (on average 1.58x) and reduces the communication volume and overhead by up to 22.89x and 27.21x (on average 8.69x and 11.60x), respectively, while achieving a comparable level of model accuracy and convergence speed compared to graph parallelism.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2308.10087 [cs.DC]
	(or arXiv:2308.10087v2 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2308.10087

Submission history

From: Jingji Chen [view email]
[v1] Sat, 19 Aug 2023 18:44:14 UTC (1,040 KB)
[v2] Sun, 24 Sep 2023 17:04:05 UTC (1,197 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:GNNPipe: Scaling Deep GNN Training with Pipelined Model Parallelism

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:GNNPipe: Scaling Deep GNN Training with Pipelined Model Parallelism

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators