Fast Stencil-Code Computation on a Wafer-Scale Processor

Rocki, Kamil; Essendelft, Dirk Van; Sharapov, Ilya; Schreiber, Robert; Morrison, Michael; Kibardin, Vladimir; Portnoy, Andrey; Dietiker, Jean François; Syamlal, Madhava; James, Michael

doi:10.1109/sc41405.2020.00062

Cited by 46 publications

(31 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, researchers at Argonne National Laboratory developed an AI-driven simulation framework for solving the same MD problem, yielding 50x speedup in time to solutions over the traditional HPC method [616]. And some work suggests these approaches need not be mutually exclusive: it has been shown in the context of computational fluid dynamics that traditional HPC workloads can be run alongside AI training to provide accelerated data-feed paths [617]. Co-locating workloads in this way may be necessary for petascale (approaching exascale) scientific simulations: Compute aside, supercomputers or large clusters (distributed compute nodes) are the only way to host some of the largest currently available models -as their memory requirements go into trillions of parameters, partitioning the model is the only way.…”

Section: Accelerated Computingmentioning

confidence: 99%

Simulation Intelligence: Towards a New Generation of Scientific Methods

Lavin¹,

Zenil²,

Krakauer³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

The original "Seven Motifs" set forth a roadmap of essential methods for the field of scientific computing, where a motif is an algorithmic method that captures a pattern of computation and data movement. 1 We present the Nine Motifs of Simulation Intelligence, a roadmap for the development and integration of the essential algorithms necessary for a merger of scientific computing, scientific simulation, and artificial intelligence. We call this merger simulation intelligence (SI), for short. We argue the motifs of simulation intelligence are interconnected and interdependent, much like the components within the layers of an operating system. Using this metaphor, we explore the nature of each layer of the simulation intelligence "operating system" stack (SI-stack) and the motifs therein:1. Multi-physics and multi-scale modeling 2. Surrogate modeling and emulation 3. Simulation-based inference 4. Causal modeling and inference 5. Agent-based modeling 6. Probabilistic programming 7. Differentiable programming 8. Open-ended optimization Machine programmingWe believe coordinated efforts between motifs offers immense opportunity to accelerate scientific discovery, from solving inverse problems in synthetic biology and climate science, to directing nuclear energy experiments and predicting emergent behavior in socioeconomic settings. We elaborate on each layer of the SI-stack, detailing the state-of-art methods, presenting examples to highlight challenges and opportunities, and advocating for specific ways to advance the motifs and the synergies from their combinations. Advancing and integrating these technologies can enable a robust and efficient hypothesis-simulation-analysis type of scientific method, which we introduce with several use-cases for human-machine teaming and automated science.

show abstract

Section: Accelerated Computingmentioning

confidence: 99%

Simulation Intelligence: Towards a New Generation of Scientific Methods

Lavin¹,

Zenil²,

Krakauer³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…5, for some source -destination pairs, there may be several alternative shortest path vectors each of which specifies its own subset of the reserve shortest paths. The choice of the most promising vector (from the point of view of providing the largest number of reserve paths when moving along the shortest path) is determined on the basis of Lemma 4 derived from (1) , but the first one has a lower denominator provided Q.E.D. This implies the possible strategy of choosing when routing a packet, such a coordinate of the shortest path vector, decreasing which by 1 preserves the greatest number of reserve paths when moving along the shortest path.…”

Section: Lemma 2 For Any Pair Source -Destination With Numbersmentioning

confidence: 99%

“…The development of multiprocessor systems-on-chip (MPSoCs) has become a ubiquitous trend and has led to the fact that modern chips can accommodate tens, hundreds or even thousands of processor cores. So, the latest versions of the WSE2 chip from Cerebras can contain up to 850,000 computing cores [1,2], and the project from Esperanto technologies promises 1088 energy-efficient ET-Minion 64-bit RISC-V each with a vector/tensor unit in ET-SoC-1 chip [3]. The operation of such large MPSoCs is not possible without a high-performance communication subsystem, the tasks of which are currently performed by the network-on-chip (NoC).…”

Section: Introductionmentioning

confidence: 99%

“…It should be noted that circulant graphs of dimensions 2 and 3 are used as topologies of multiprocessor systems [15,20,21], and also as promising topologies of NoCs [5,8]. Modern chips already can have more than 1000 cores [1][2][3]. For the most part, they are still connected by mesh topology, but at the same time (due to the large number of nodes), the distance between nodes is too large, which is to be corrected by global ring connections (for example, over a wireless channel [23,24]).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Adaptive Dynamic Shortest Path Search Algorithm in Networks-on-Chip Based on Circulant Topologies

et al. 2021

View full text Add to dashboard Cite

A new pair routing algorithm for transmitting messages in multiprocessor systems and networks-on-chip based on circulant networks of arbitrary dimension is proposed. It allows using all reserve shortest paths in the presence of destructive factors (deadlocks, livelocks, starvation, failures) at the nodes and channels of the communication network. A distinctive feature of the proposed algorithm is the absence of using the routing tables with fixed shortest paths when message is transmitted. It becomes possible to determine the set of the shortest paths for routing due to the relative addresses of destination nodes based on a parametric description of the network. Estimates of the number of reserve shortest paths are obtained, and an effective algorithm for using these paths to prevent dynamic topology changes and network congestions is proposed. To reduce the required memory in networks-on-chip with a circulant topology, we proposed a version of the routing algorithm for two-dimensional optimal circulants. We experimentally found the minimum number of reference nodes (nodes containing mapping tables) for them and estimates of memory for mapping tables, as well as the average path length for the routing algorithm using the reference nodes.INDEX TERMS pair routing algorithm, reserve shortest path, circulant network, network-on-chip

show abstract

“…For example, matrix dimension sizes span from single digits to millions while matrix sparsity spans from ∼ 10 −5 % dense to fully dense [9]. The vast amount of workloads has led to many accelerator architecture proposals, as they achieve higher throughput than CPUs, and higher energy efficiency than GPUs [21], [46], [48] performs well for workloads of high unstructured sparsity, but not for dense computations due to its sparse controller overhead. Large datacenters require flexibility, as in they must have the compute and memory resources to perform all current and future workloads efficiently.…”

Section: Introductionmentioning

confidence: 99%

Enabling Flexibility for Sparse Tensor Acceleration via Heterogeneity

Qin¹,

Garg²,

Bambhaniya³

et al. 2022

Preprint

View full text Add to dashboard Cite

Recently, numerous sparse hardware accelerators for Deep Neural Networks (DNNs), Graph Neural Networks (GNNs), and scientific computing applications have been proposed. A common characteristic among all of these accelerators is that they target tensor algebra (typically matrix multiplications); yet dozens of new accelerators are proposed for every new application. The motivation is that the size and sparsity of the workloads heavily influence which architecture is best for memory and computation efficiency. To satisfy the growing demand of efficient computations across a spectrum of workloads on large data-centers, we propose deploying a flexible 'heterogeneous' accelerator, which contains many 'sub-accelerators' (smaller specialized accelerators) working together. To this end, we propose: (1) HARD TACO, a quick and productive C++ to RTL design flow to generate many types of sub-accelerators for sparse and dense computations for fair design-space exploration, (2) AESPA, a heterogeneous sparse accelerator design template constructed with the sub-accelerators generated from HARD TACO, and (3) a suite of scheduling strategies to map tensor kernels onto heterogeneous sparse accelerators with high efficiency and utilization. AESPA with optimized scheduling achieves 1.96× higher performance, and 7.9× better energy delay product (EDP) than a Homogeneous EIE-like accelerator with our diverse workload suite.

show abstract

Fast Stencil-Code Computation on a Wafer-Scale Processor

Cited by 46 publications

References 14 publications

Simulation Intelligence: Towards a New Generation of Scientific Methods

Simulation Intelligence: Towards a New Generation of Scientific Methods

Adaptive Dynamic Shortest Path Search Algorithm in Networks-on-Chip Based on Circulant Topologies

Enabling Flexibility for Sparse Tensor Acceleration via Heterogeneity

Contact Info

Product

Resources

About