M. Ozan Karsavuran scite author profile

Sparse matrix-vector and matrix-transpose-vector multiplication (SpMM T V) repeatedly performed as z A T x and y A z (or y A w) for the same sparse matrix A is a kernel operation widely used in various iterative solvers. One important optimization for serial SpMM T V is reusing A-matrix nonzeros, which halves the memory bandwidth requirement. However, thread-level parallelization of SpMM T V that reuses A-matrix nonzeros necessitates concurrent writes to the same output-vector entries. These concurrent writes can be handled in two ways: via atomic updates or thread-local temporary output vectors that will undergo a reduction operation, both of which are not efficient or scalable on processors with many cores and complicated cache-coherency protocols. In this work, we identify five quality criteria for efficient and scalable thread-level parallelization of SpMM T V that utilizes one-dimensional (1D) matrix partitioning. We also propose two locality-aware 1D partitioning methods, which achieve reusing A-matrix nonzeros and intermediate z-vector entries; exploiting locality in accessing x-, y-, and z-vector entries; and reducing the number of concurrent writes to the same output-vector entries. These two methods utilize rowwise and columnwise singly bordered block-diagonal (SB) forms of A. We evaluate the validity of our methods on a wide range of sparse matrices. Experiments on the 60-core cache-coherent Intel Xeon Phi processor show the validity of the identified quality criteria and the validity of the proposed methods in practice. The results also show that the performance improvement from reusing A-matrix nonzeros compensates for the overhead of concurrent writes through the proposed SB-based methods.Index Terms-Cache locality, sparse matrix, sparse matrix-vector multiplication, matrix reordering, singly bordered block-diagonal form, Intel Many Integrated Core Architecture (Intel MIC), Intel Xeon PhiThe authors are with the

show abstract

Partitioning Models for General Medium-Grain Parallel Sparse Tensor Decomposition

Karsavuran

Acer

Aykanat

2021

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

The focus of this article is efficient parallelization of the canonical polyadic decomposition algorithm utilizing the alternating least squares method for sparse tensors on distributed-memory architectures. We propose a hypergraph model for general mediumgrain partitioning which does not enforce any topological constraint on the partitioning. The proposed model is based on splitting the given tensor into nonzero-disjoint component tensors. Then a mode-dependent coarse-grain hypergraph is constructed for each component tensor. A net amalgamation operation is proposed to form a composite medium-grain hypergraph from these modedependent coarse-grain hypergraphs to correctly encapsulate the minimization of the communication volume. We propose a heuristic which splits the nonzeros of dense slices to obtain sparse slices in component tensors. So we partially attain slice coherency at (sub)slice level since partitioning is performed on (sub)slices instead of individual nonzeros. We also utilize the well-known recursivebipartitioning framework to improve the quality of the splitting heuristic. Finally, we propose a medium-grain tripartite graph model with the aim of a faster partitioning at the expense of increasing the total communication volume. Parallel experiments conducted on 10 realworld tensors on up to 1024 processors confirm the validity of the proposed hypergraph and graph models.

show abstract

Minimizing Staleness and Communication Overhead in Distributed SGD for Collaborative Filtering

Abubaker

Caglayan

Karsavuran

et al. 2023

IEEE Trans. Comput.

View full text Add to dashboard Cite

Distributed asynchronous stochastic gradient descent (ASGD) algorithms that approximate low-rank matrix factorizations for collaborative filtering perform one or more synchronizations per epoch where staleness is reduced with more synchronizations. However, high number of synchronizations would prohibit the scalability of the algorithm. We propose a parallel ASGD algorithm, η-PASGD, for efficiently handling η synchronizations per epoch in a scalable fashion. The proposed algorithm puts an upper limit of K on η, for a K-processor system, such that performing η = K synchronizations per epoch would eliminate the staleness completely. The rating data used in collaborative filtering are usually represented as sparse matrices. The sparsity allows for reduction in the staleness and communication overhead combinatorially via intelligently distributing the data to processors. We analyze the staleness and the total volume incurred during an epoch of η-PASGD. Following this analysis, we propose a hypergraph partitioning model to encapsulate reducing staleness and volume while minimizing the maximum number of synchronizations required for a stale-free SGD. This encapsulation is achieved with a novel cutsize metric that is realized via a new recursive-bipartitioning-based algorithm. Experiments on up to 512 processors show the importance of the proposed partitioning method in improving staleness, volume, RMSE and parallel runtime.

show abstract

Scaling Stratified Stochastic Gradient Descent for Distributed Matrix Completion

Abubaker

Karsavuran

Aykanat

2023

IEEE Trans. Knowl. Data Eng.

View full text Add to dashboard Cite

Stratified SGD (SSGD) is the primary approach for achieving serializable parallel SGD for matrix completion. State-of-the-art parallelizations of SSGD fail to scale due to large communication overhead. During an SGD epoch, these methods send data proportional to one of the dimensions of the rating matrix. We propose a framework for scalable SSGD through significantly reducing the communication overhead via exchanging point-to-point messages utilizing the sparsity of the rating matrix. We provide formulas to represent the essential communication for correctly performing parallel SSGD and we propose a dynamic programming algorithm for efficiently computing them to establish the point-to-point message schedules. This scheme, however, significantly increases the number of messages sent by a processor per epoch from O(K) to O(K 2 ) for a K-processor system which might limit the scalability. To remedy this, we propose a Hold-and-Combine strategy to limit the upper-bound on the number of messages sent per processor to O(K lgK). We also propose a hypergraph partitioning model that correctly encapsulates reducing the communication volume. Experimental results show that the framework successfully achieves a scalable distributed SSGD through significantly reducing the communication overhead. Our code is publicly available at: github.com/nfabubaker/CESSGD

show abstract

Simultaneous Computational and Data Load Balancing in Distributed-Memory Setting

Çeliktuğ¹,

Karsavuran²,

Acer³

et al. 2022

SIAM J. Sci. Comput.

View full text Add to dashboard Cite

Several successful partitioning models and methods have been proposed and used for computational load balancing of irregularly sparse applications in distributed-memory setting. However, the literature lacks partitioning models and methods that encode both computational and data load balancing. In this article, we try to close this gap in the literature by proposing two hypergraph partitioning (HP) models which simultaneously encode computational and data load balancing. Both models utilize a two-constraint formulation where the rst constraint encodes the computational load and the second constraint encodes the data load. In the rst model, we introduce explicit data vertices for encoding data load and we replicate those data vertices at each recursive bipartitioning (RB) step for encoding data replication. In the second model, we introduce a data weight distribution scheme for encoding data load and we update those weights at each RB step. The nice property of both proposed models is that they do not necessitate developing a new partitioner from scratch. Both models can easily be implemented by invoking any HP tool that supports multiconstraint partitioning as a two-way partitioner at each RB step. The validity of the proposed models is tested on two widely-used irregularly sparse applications: parallel mesh simulations and parallel sparse matrix sparse matrix multiplication (SpGEMM). Both proposed models achieve signicant improvement over a baseline model.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.