We describe our experience porting the Regensburg implementation of the DD-αAMG solver from QPACE 2 to QPACE 3. We first review how the code was ported from the first generation Intel Xeon Phi processor (Knights Corner) to its successor (Knights Landing). We then describe the modifications in the communication library necessitated by the switch from InfiniBand to Omni-Path. Finally, we present the performance of the code on a single processor as well as the scaling on many nodes, where in both cases the speedup factor is close to the theoretical expectations. IntroductionThe lattice QCD (LQCD) community has traditionally been an early adopter of new computing and network architectures. This typically requires major efforts porting simulation code or even communication libraries. The Regensburg lattice group (RQCD) has been involved in such efforts, as well as supercomputer development, for more than a decade. While the first computer in the QPACE series [1,2] was based on IBM's Cell processor and an FPGA-based custom interconnect, the subsequent machines are using Intel's Xeon Phi series with standard interconnects (see Sec. 2.1). To satisfy the increasing demands of the RQCD physics program we use a state-of-the-art method, DD-αAMG [3], to solve the discretized form of the Dirac equation. The high-performance implementation of this solver on QPACE 2 is described in [4][5][6][7]. The present contribution focuses on the software efforts we made to efficiently run this implementation on QPACE 3. This paper is structured as follows. In Sec. 2 we give an overview of QPACE 3 and highlight the differences to QPACE 2 in terms of processor and network. We discuss the network technology in some detail because it has changed rather drastically. In Sec. 3 we describe how our solver and our communication library were adapted to the new technologies. In Sec. 4 we present single-node and multi-node benchmarks of the solver on QPACE 3 and compare the results with numbers obtained on QPACE 2. In Sec. 5 we conclude and give an outlook on future work. QPACE 3 2.1 OverviewWhile QPACE 2 [8] is based on the Knights Corner (KNC) version of the Intel Xeon Phi processor series and an FDR InfiniBand network, its successor QPACE 3 utilizes the current Xeon Phi processor, Speaker,
On many parallel machines, the time LQCD applications spent in communication is a significant contribution to the total wall-clock time, especially in the strong-scaling limit. We present a novel high-performance communication library that can be used as a de facto drop-in replacement for MPI in existing software. Its lightweight nature that avoids some of the unnecessary overhead introduced by MPI allows us to improve the communication performance of applications without any algorithmic or complicated implementation changes. As a first real-world benchmark, we make use of the pMR library in the coarse-grid solve of the Regensburg implementation of the DD-αAMG algorithm. On realistic lattices, we see an improvement of a factor 2x in pure communication time and total execution time savings of up to 20%. 34th annual International Symposium on Lattice Field Theory
We present details of our implementation of the Wuppertal adaptive algebraic multigrid code DD-αAMG on SIMD architectures, with particular emphasis on the Intel Xeon Phi processor (KNC) used in QPACE 2. As a smoother, the algorithm uses a domain-decomposition-based solver code previously developed for the KNC in Regensburg. We optimized the remaining parts of the multigrid code and conclude that it is a very good target for SIMD architectures. Some of the remaining bottlenecks can be eliminated by vectorizing over multiple test vectors in the setup, which is discussed in the contribution of Daniel Richtmann.
Optimization of applications for supercomputers of the highest performance class requires parallelization at multiple levels using different techniques. In this contribution we focus on parallelization of particle physics simulations through vector instructions. With the advent of the Scalable Vector Extension (SVE) ISA, future ARM-based processors are expected to provide a significant level of parallelism at this level.
Cancer progression can be described by continuous-time Markov chains whose state space grows exponentially in the number of somatic mutations. The age of a tumor at diagnosis is typically unknown. Therefore, the quantity of interest is the time-marginal distribution over all possible genotypes of tumors, defined as the transient distribution integrated over an exponentially distributed observation time. It can be obtained as the solution of a large linear system. However, the sheer size of this system renders classical solvers infeasible. We consider Markov chains whose transition rates are separable functions, allowing for an efficient low-rank tensor representation of the linear system’s operator. Thus we can reduce the computational complexity from exponential to linear. We derive a convergent iterative method using low-rank formats whose result satisfies the normalization constraint of a distribution. We also perform numerical experiments illustrating that the marginal distribution is well approximated with low rank.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.