ParSy: Inspection and Transformation of Sparse Matrix Computations for Parallelism

Cheshmi, Kazem; Kamil, Shoaib; Strout, Michelle Mills; Dehnavi, Maryam Mehri

doi:10.1109/sc.2018.00065

Cited by 19 publications

(19 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Wavefronts of the joint DAG can be aggregated to reduce the number of synchronizations. DAG partitioners such as Load-Balanced Level Coarsening (LBC) [8] and DAGP [20] apply aggregation, however, when applied to the joint DAG because they aggregate iterations from consecutive wavefronts, load imbalance might still occur. Also, by aggregating iterations from wavefronts in the joint DAG, DAG partitioning methods potentially improve the temporal locality between the two kernels but, this can disturb spatial locality within each kernel.…”

Section: Unfusedmentioning

confidence: 99%

“…𝑐 (𝑣 𝑖 ) is the computational load of a vertex and is defined as the total number of nonzeros touched to complete its computation. Because sparse matrix computations are generally memory bandwidth-bound, 𝑐 (𝑣 𝑖 ) is a good metric to evaluate load balance in the algorithm [8]. 𝐹 is stored in the compressed sparse row (CSR) format and 𝐹 𝑖 is used to extract the set of vertices in 𝐺 1 that 𝑣 𝑖 ∈ 𝑉 2 depends on.…”

Section: Inputs and Output To Mspmentioning

confidence: 99%

“…The fused partitioning shown in Figure 2e has two s-partitions (𝑏=2). The first s-partition has three w-partitions (𝑚 1 =3) shown with V 𝑠 1 = {[1, 2, 3, 4]; [5,6,5,6]; [7,8,9,9]}, the underscored vertices belong to 𝐺 1 .…”

Section: Inputs and Output To Mspmentioning

confidence: 99%

See 2 more Smart Citations

Composing Loop-carried Dependence with Other Loops

Cheshmi¹,

Strout²,

Dehnavi³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Sparse fusion is a compile-time loop transformation and runtime scheduling implemented as a domain-specific code generator. Sparse fusion generates efficient parallel code for the combination of two sparse matrix kernels where at least one of the kernels has loop-carried dependencies. Available implementations optimize individual sparse kernels. When optimized separately, the irregular dependence patterns of sparse kernels create synchronization overheads and load imbalance, and their irregular memory access patterns result in inefficient cache usage, which reduces parallel efficiency. Sparse fusion uses a novel inspection strategy with code transformations to generate parallel fused code for sparse kernel combinations that is optimized for data locality and load balance. Code generated by Sparse fusion outperforms the existing implementations ParSy and MKL on average 1.6× and 5.1× respectively and outperforms the LBC and DAGP coarsening strategies applied to a fused data dependence graph on average 5.1× and 7.2× respectively for various kernel combinations.

show abstract

Section: Unfusedmentioning

confidence: 99%

Section: Inputs and Output To Mspmentioning

confidence: 99%

See 1 more Smart Citation

Composing Loop-carried Dependence with Other Loops

Cheshmi¹,

Strout²,

Dehnavi³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Several sparse tensor compilers target linear algebra operations (e.g. taco [Chou et al 2018;Kjolstad et al 2017], ParSy [Cheshmi et al 2017[Cheshmi et al , 2018). They focus on constructing efficient iteration spaces between different sparse matrices under linear algebra operations.…”

Section: Related Work 81 Array Compilersmentioning

confidence: 99%

Taichi

Hu¹,

Anderson³

et al. 2019

ACM Trans. Graph.

160

View full text Add to dashboard Cite

3D visual computing data are often spatially sparse. To exploit such sparsity, people have developed hierarchical sparse data structures, such as multi-level sparse voxel grids, particles, and 3D hash tables. However, developing and using these high-performance sparse data structures is challenging, due to their intrinsic complexity and overhead. We propose Taichi , a new data-oriented programming language for efficiently authoring, accessing, and maintaining such data structures. The language offers a high-level, data structure-agnostic interface for writing computation code. The user independently specifies the data structure. We provide several elementary components with different sparsity properties that can be arbitrarily composed to create a wide range of multi-level sparse data structures. This decoupling of data structures from computation makes it easy to experiment with different data structures without changing computation code, and allows users to write computation as if they are working with a dense array. Our compiler then uses the semantics of the data structure and index analysis to automatically optimize for locality, remove redundant operations for coherent accesses, maintain sparsity and memory allocations, and generate efficient parallel and vectorized instructions for CPUs and GPUs. Our approach yields competitive performance on common computational kernels such as stencil applications, neighbor lookups, and particle scattering. We demonstrate our language by implementing simulation, rendering, and vision tasks including a material point method simulation, finite element analysis, a multigrid Poisson solver for pressure projection, volumetric path tracing, and 3D convolution on sparse grids. Our computation-data structure decoupling allows us to quickly experiment with different data arrangements, and to develop high-performance data structures tailored for specific computational tasks. With 1 <u> 1 </u> 0 th as many lines of code, we achieve 4.55× higher performance on average, compared to hand-optimized reference implementations.

show abstract

“…Cheshmi et al [6,7] developed Sympiler, an I/E compiler to optimize sparse computations by exploiting properties of the sparsity structure. This leads to executor code which leverages the sparsity pattern within the computation.…”

Section: Related Workmentioning

confidence: 99%

Generating piecewise-regular code from irregular structures

Augustine

Sarma

Pouchet

et al. 2019

Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation

View full text Add to dashboard Cite

Irregular data structures, as exemplified with sparse matrices, have proved to be essential in modern computing. Numerous sparse formats have been investigated to improve the overall performance of Sparse Matrix-Vector multiply (SpMV). But in this work we propose instead to take a fundamentally different approach: to automatically build sets of regular subcomputations by mining for regular sub-regions in the irregular data structure. Our approach leads to code that is specialized to the sparsity structure of the input matrix, but which does not need anymore any indirection array, thereby improving SIMD vectorizability. We particularly focus on small sparse structures (below 10M nonzeros), and demonstrate substantial performance improvements and compaction capabilities compared to a classical CSR implementation and Intel MKL IE's SpMV implementation, evaluating on 200+ different matrices from the SuiteSparse repository. CCS Concepts • Software and its engineering → Source code generation; • Theory of computation → Data compression; Program analysis.

show abstract

ParSy: Inspection and Transformation of Sparse Matrix Computations for Parallelism

Cited by 19 publications

References 46 publications

Composing Loop-carried Dependence with Other Loops

Composing Loop-carried Dependence with Other Loops

Taichi

Generating piecewise-regular code from irregular structures

Contact Info

Product

Resources

About