Abstract. In this article, we introduce a cache-oblivious method for sparse matrix-vector multiplication. Our method attempts to permute the rows and columns of the input matrix using a recursive hypergraph-based sparse matrix partitioning scheme so that the resulting matrix induces cache-friendly behavior during sparse matrix-vector multiplication. Matrices are assumed to be stored in row-major format, by means of the compressed row storage (CRS) or its variants incremental CRS and zig-zag CRS. The zig-zag CRS data structure is shown to fit well with the hypergraph metric used in partitioning sparse matrices for the purpose of parallel computation. The separated block-diagonal (SBD) form is shown to be the appropriate matrix structure for cache enhancement. We have implemented a run-time cache simulation library enabling us to analyze cache behavior for arbitrary matrices and arbitrary cache properties during matrix-vector multiplication within a k-way set-associative idealized cache model. The results of these simulations are then verified by actual experiments run on various cache architectures. In all these experiments, we use the Mondriaan sparse matrix partitioner in one-dimensional mode. The savings in computation time achieved by our matrix reorderings reach up to 50 percent, in the case of a large link matrix.
The sparse matrix-vector multiplication is an important kernel, but is hard to efficiently execute even in the sequential case. The problems-namely low arithmetic intensity, inefficient cache use, and limited memory bandwidth-are magnified as the core count on shared-memory parallel architectures increases. Existing techniques are discussed in detail, and categorised chiefly based on their distribution types. Based on this new parallelisation techniques are proposed. The theoretical scalability and memory usage of the various strategies are analysed, and experiments on multiple NUMA architectures confirm the validity of the results. One of the newly proposed methods attains the best average result in experiments, in one of the experiments obtaining a parallel efficiency of 90 percent.
The bulk synchronous parallel (BSP) model, as well as parallel programming interfaces based on BSP, classically target distributed-memory parallel architectures. In earlier work, Yzelman and Bisseling designed a MulticoreBSP for Java library specifically for shared-memory architectures. In the present article, we further investigate this concept and introduce the new high-performance MulticoreBSP for C library. Among other features, this library supports nested BSP runs. We show that existing BSP software performs well regardless whether it runs on distributedmemory or shared-memory architectures, and show that applications in MulticoreBSP can attain high-performance results. The paper details implementing the Fast Fourier Transform and the sparse matrix-vector multiplication in BSP, both of which outperform state-of-the-art implementations written in other shared-memory parallel programming interfaces. We furthermore study the applicability of BSP when working on highly non-uniform memory access architectures.The bulk synchronous parallel (BSP) model [19], introduced by Valiant, describes a powerful abstraction of parallel computers. It enables the design of theoretically optimal parallel algorithms, and inspired many interfaces for parallel programming. The BSP model consists of three parts: (1) an abstraction of a parallel computer, (2) an abstraction of a parallel algorithm, and (3) a cost model. A BSP computer has p homogeneous processors, each one with access to local memory. They cannot access remote memory, but may communicate through a black-box network interconnect. Preparing the network for all-to-all communication while synchronising the p processors at the start and end of communication costs l units; sending a data word during the all-to-all communication costs g units. Measuring l and g in seconds does not directly relate to any work done; instead, if the speed r of each processor is measured in floating-point operations per second (flop/s), we express l and g in flops as well. The four parameters ( p, r, l, g) completely define a BSP computer.A BSP algorithm runs on a BSP computer and adheres to the Single Program, Multiple Data (SPMD) paradigm. Each BSP process consists of alternating computation and communication phases. During computation, each process executes sequential code and cannot communicate with other BSP processes; during communication all processes are involved in an all-to-all data interchange and cannot perform any computations. BSP synchronises all processors in-between phases. We define one superstep as one computation phase combined with the communication phase that directly follows it.This definition of a BSP computer and a BSP algorithm immediately leads to the BSP cost model. If the algorithm consists of T supersteps, and if process s has w (s) i
a b s t r a c tIn earlier work, we presented a one-dimensional cache-oblivious sparse matrix-vector (SpMV) multiplication scheme which has its roots in one-dimensional sparse matrix partitioning. Partitioning is often used in distributed-memory parallel computing for the SpMV multiplication, an important kernel in many applications. A logical extension is to move towards using a two-dimensional partitioning. In this paper, we present our research in this direction, extending the one-dimensional method for cache-oblivious SpMV multiplication to two dimensions, while still allowing only row and column permutations on the sparse input matrix. This extension requires a generalisation of the compressed row storage data structure to a block-based data structure, for which several variants are investigated. Experiments performed on three different architectures show further improvements of the two-dimensional method compared to the one-dimensional method, especially in those cases where the one-dimensional method already provided significant gains. The largest gain obtained by our new reordering is over a factor of 3 in SpMV speed, compared to the natural matrix ordering.Our goal is to obtain the best performance without knowledge of the cache parameters, such as cache size, line size, et cetera. This approach was introduced by Frigo et al. and is called cache-oblivious [5]. The advantage of such methods is that they work irrespective of hardware details, which may be intricate and can vary tremendously from machine to machine. Cache-oblivious approaches are often based on recursion, as this enables subsequent decreases of the problem size until the problem fits in cache. In the case of SpMV multiplication, permuting the rows and columns of the matrix A, while permuting the vectors x and y accordingly, can be done in a cache-oblivious way to improve cache use. A further improvement is changing the order of access to the individual nonzero elements of A. Both of these methods are explored in this work.The organisation of this paper is as follows: we first proceed with briefly explaining the 1D method in Section 1.1 and presenting related work in Section 1.2, and immediately follow up with the extension to 2D in Section 2. These methods are subjected to numerical experiments in Section 3. We draw our conclusions in Section 4. The one-dimensional schemeThe sparsity structure of an m  n matrix A can be modelled by a hypergraph H ¼ ðV; N Þ using the row-net model, which will briefly be described here; for a broader introduction, see Çatalyürek et al. [2]. The columns of A are modelled by the vertices in V, and the rows by the nets (or hyperedges) in N , where a net is a subset of the vertices. Each net contains precisely those vertices (i.e., columns) that have a nonzero in the corresponding row of A. A partitioning of a matrix into p parts is a partitioning of V into nonempty subsets V 0 ; . . . ; V pÀ1 , with each pair of subsets disjoint and [ j V j ¼ V. Given such a partitioning, the connectivity k i of a net n i in N c...
Computation on tensors, treated as multidimensional arrays, revolve around generalized basic linear algebra subroutines (BLAS). We propose a novel data structure in which tensors are blocked and blocks are stored in an order determined by Morton order. This is not only proposed for efficiency reasons, but also to induce efficient performance regardless of which mode a generalized BLAS call is invoked for; we coin the term mode-oblivious to describe data structures and algorithms that induce such behavior. Experiments on one of the most bandwidth-bound generalized BLAS kernel, the tensor-vector multiplication, not only demonstrate superior performance over two state-of-the-art variants by up to 18%, but additionally show that the proposed data structure induces a 71% less sample standard deviation for tensor-vector multiplication across d modes, where d varies from 2 to 10. Finally, we show our data structure naturally expands to other tensor kernels and demonstrate up to 38% higher performance for the higher-order power method.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.