Abstract-This paper presents an in-depth analysis of the impact of system noise on large-scale parallel application performance in realistic settings. Our analytical model shows that not only collective operations but also point-to-point communications influence the application's sensitivity to noise. We present a simulation toolchain that injects noise delays from traces gathered on common large-scale architectures into a LogGPS simulation and allows new insights into the scaling of applications in noisy environments. We investigate collective operations with up to 1 million processes and three applications (Sweep3D, AMG, and POP) with up to 32,000 processes. We show that the scale at which noise becomes a bottleneck is system-specific and depends on the structure of the noise. Simulations with different network speeds show that a 10x faster network does not improve application scalability. We quantify noise and conclude that our tools can be utilized to tune the noise signatures of a specific system. I. MOTIVATION AND BACKGROUNDThe performance impact of operating system and architectural overheads (system noise) at massive scale is increasingly of concern. Even small local delays on compute nodes, which can be caused by interrupts, operating system daemons, or even cache or page misses, can affect global application performance significantly [1]. Such local delays often cause less than 1% overhead per process but severe performance losses can occur if noise is propagated (amplified) through communication or global synchronization. Previous analyses generally assume that the performance impact of system noise grows at scale and Tsafrir et al. [2] even suggest that the impact of very low frequency noise scales linearly with the system size. A. Related WorkPetrini, Kerbyson, and Pakin [1] report that the parallel performance of SAGE on a fixed number of ASCI Q nodes was highest when SAGE used only three of the four CPUs per node. It turned out that "resonance" between the application's collective communication and the misconfigured system caused delays during each iteration. Jones, Brenner, and Fier [3] observed similar effects with collective communication and also report that, under certain circumstances, it is beneficial to leave one CPU idle. A theoretical analysis of the influence of noise on collective communication [4] suggests that the impact of noise depends on the type of distribution and their parameters and can, in the worst case (exponential distribution), scale linearly with the number of processes. Ferreira, Bridges, and Brightwell use noise-injection techniques to assess the impact of noise on several applications [5]. Beckman et al.[6] analyzed the performance on BlueGene/L, concluding that most sources of noise can be avoided in very specialized systems.Previous work was either limited to experimental analysis on specific architectures with injection of artificially generated noise (fixed frequency), or to purely theoretical analyses that assume a particular collective pattern [4]. These previou...
Multistage interconnection networks based on central switches are ubiquitous in high-performance computing. Applications and communication libraries typically make use of such networks without consideration of the actual internal characteristics of the switch. However, application performance of these networks, particularly with respect to bisection bandwidth, does depend on communication paths through the switch. In this paper we discuss the limitations of the hardware definition of bisection bandwidth (capacity-based) and introduce a new metric: effective bisection bandwidth. We assess the effective bisection bandwidth of several large-scale production clusters by simulating artificial communication patterns on them. Networks with full bisection bandwidth typically provided effective bisection bandwidth in the range of 55-60%. Simulations with application-based patterns showed that the difference between effective and rated bisection bandwidth could impact overall application performance by up to 12%.
Traditional database operators such as joins are relevant not only in the context of database engines but also as a building block in many computational and machine learning algorithms. With the advent of big data, there is an increasing demand for efficient join algorithms that can scale with the input data size and the available hardware resources.In this paper, we explore the implementation of distributed join algorithms in systems with several thousand cores connected by a low-latency network as used in high performance computing systems or data centers. We compare radix hash join to sort-merge join algorithms and discuss their implementation at this scale. In the paper, we explain how to use MPI to implement joins, show the impact and advantages of RDMA, discuss the importance of network scheduling, and study the relative performance of sorting vs. hashing. The experimental results show that the algorithms we present scale well with the number of cores, reaching a throughput of 48.7 billion input tuples per second on 4,096 cores.
Although several pharmacogenetic (PGx) predispositions affecting drug efficacy and safety are well established, drug selection and dosing as well as clinical trials are often performed in a non-pharmacogenetically-stratified manner, ultimately burdening healthcare systems. Pre-emptive PGx testing offers a solution which is often performed using microarrays or targeted gene panels, testing for common/known PGx variants. However, as an added value, whole-genome sequencing (WGS) could detect not only disease-causing but also pharmacogenetically-relevant variants in a single assay. Here, we present our WGS-based pipeline that extends the genetic testing of Mendelian diseases with PGx profiling, enabling the detection of rare/novel PGx variants as well. From our in-house WGS (PCR-free 60× PE150) data of 547 individuals we extracted PGx variants with drug-dosing recommendations of the Dutch Pharmacogenetics Working Group (DPWG). Furthermore, we explored the landscape of DPWG pharmacogenes in gnomAD and our in-house cohort as well as compared bioinformatic tools for WGS-based structural variant detection in CYP2D6. We show that although common/known PGx variants comprise the vast majority of detected DPWG pharmacogene alleles, for better precision medicine, PGx testing should move towards WGS-based approaches. Indeed, WGS-based PGx profiling is not only feasible and future-oriented but also the most comprehensive all-in-one approach without generating significant additional costs.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.