Heterogeneous parallel algorithm design and performance optimization for WENO on the Sunway Taihulight supercomputer

Self Cite

A moisture advection scheme is an essential module of a numerical weather/climate model representing the horizontal transport of water vapor. The Piecewise Rational Method (PRM) scalar advection scheme in the Global/Regional Assimilation and Prediction System (GRAPES) solves the moisture flux advection equation based on PRM. Computation of the scalar advection involves boundary exchange, and computation of higher bandwidth requirements is complicated and time-consuming in GRAPES. Recently, Graphics Processing Units (GPUs) have been widely used to solve scientific and engineering computing problems owing to advancements in GPU hardware and related programming models such as CUDA/OpenCL and Open Accelerator (OpenACC). Herein, we present an accelerated PRM scalar advection scheme with Message Passing Interface (MPI) and OpenACC to fully exploit GPUs' power over a cluster with multiple Central Processing Units (CPUs) and GPUs, together with optimization of various parameters such as minimizing data transfer, memory coalescing, exposing more parallelism, and overlapping computation with data transfers. Results show that about 3.5 times speedup is obtained for the entire model running at medium resolution with double precision when comparing the scheme's elapsed time on a node with two GPUs (NVIDIA P100) and two 16-core CPUs (Intel Gold 6142). Further, results obtained from experiments of a higher resolution model with multiple GPUs show excellent scalability.

Section: Algorithmmentioning

confidence: 99%

“…Several examples that partially adapted GPUs in weather and climate prediction codes showed performance gains [7][8][9][10][11][12][13][14][15][16][17] . Especially, GPU acceleration of scalar or tracer advection modules using Compute Unified Device Architecture (CUDA) C/Fortran achieves an approximately three-fold speedup [12,18,19] .…”

Section: Introductionmentioning

confidence: 99%

An MPI+OpenACC-based PRM scalar advection scheme in the GRAPES model over a cluster with multiple CPUs and GPUs

Xiao

Lü

Huang

et al. 2022

Self Cite

“…Numerous approaches have emerged in various fields trying to solve this problem. Among them are two most effective and common solutions: One is to restructure the model at the software level by super individual or other methods; another is to speed up large-scale computing by distributed parallel computing [91,92] or using new computation tools, such as the Quantum tool [93] .…”

Section: Large-scale Mamsmentioning

confidence: 99%

Multi-agent modeling and simulation in the AI age

Fan¹,

Chen²,

Shi³

et al. 2021

With the rapid development of artificial intelligence (AI) technology and its successful application in various fields, modeling and simulation technology, especially multi-agent modeling and simulation (MAMS), of complex systems has rapidly advanced. In this study, we first describe the concept, technical advantages, research steps, and research status of MAMS. Then we review the development status of the hybrid modeling and simulation combining multi-agent and system dynamics, the modeling and simulation of multi-agent reinforcement learning, and the modeling and simulation of large-scale multi-agent. Lastly, we introduce existing MAMS platforms and their comparative studies. This work summarizes the current research situation of MAMS, thus helping scholars understand the systematic technology development of MAMS in the AI era. It also paves the way for further research on MAMS technology.

“…The past few decades have witnessed an explosion of data in both the number of observations and parameters, resulting in significant interests in distributed algorithms for solving large-scale machine learning problems [1][2][3][4][5][6][7] . However, efficient implementations of the distributed optimization algorithms for machine learning applications are challenging.…”

Section: Introductionmentioning

confidence: 99%

SignGD with error feedback meets lazily aggregated technique: Communication-efficient algorithms for distributed learning

Deng

Sun

Liu

et al. 2022

The proliferation of massive datasets has led to significant interests in distributed algorithms for solving large-scale machine learning problems. However, the communication overhead is a major bottleneck that hampers the scalability of distributed machine learning systems. In this paper, we design two communication-efficient algorithms for distributed learning tasks. The first one is named EF-SIGNGD, in which we use the 1-bit (sign-based) gradient quantization method to save the communication bits. Moreover, the error feedback technique, i.e., incorporating the error made by the compression operator into the next step, is employed for the convergence guarantee. The second algorithm is called LE-SIGNGD, in which we introduce a well-designed lazy gradient aggregation rule to EF-SIGNGD that can detect the gradients with small changes and reuse the outdated information. LE-SIGNGD saves communication costs both in transmitted bits and communication rounds. Furthermore, we show that LE-SIGNGD is convergent under some mild assumptions. The effectiveness of the two proposed algorithms is demonstrated through experiments on both real and synthetic data.