Sparse Matrix-Vector Multiplication (SpMV) kernel dominates the computing cost in numerous scientific applications. Many implementations based on different sparse formats were proposed recently for this kernel on the GPU side. Since the performance of these sparse formats varies significantly according to the sparsity characteristics of the input matrix and the hardware specifications, no one of them can be considered as the best one to use for every sparse matrix. In this paper, we address the problem of selecting the best representation for a given sparse matrix on GPU by using a machine learning approach. First, we present some interesting and easy to compute features for characterizing the sparse matrices on GPU. Second, we use a multiclass Support Vector Machine (SVM) classifier to select the best format for each input matrix. We consider in this paper four popular formats (COO, CSR, ELL, and HYB), but our work can be extended to support more sparse representations. Experimental results on two different GPUs (Fermi GTX 580 and Maxwell GTX 980 Ti) show that we achieved more than 98% of the performance possible with a perfect selection.
Abstract. Loop scheduling scheme plays a critical role in the efficient execution of programs, especially loop dominated applications. This paper presents KASS, a knowledge-based adaptive loop scheduling scheme. KASS consists of two phases: static partitioning and dynamic scheduling. To balance the workload, the knowledge of loop features and the capabilities of processors are both taken into account using a heuristic approach in static partitioning phase. In dynamic scheduling phase, an adaptive self-scheduling algorithm is applied, in which two tuning parameters are set to control chunk sizes, aiming at load balancing and minimizing synchronization overhead. In addition, we extend KASS to apply on loop nests and adjust the chunk sizes at runtime. The experimental results show that KASS performs 4.8% to 16.9% better than the existing self-scheduling schemes, and up to 21% better than the affinity scheduling scheme.
Determinism is an appealing property for parallel programs, as it simplifies understanding, reasoning and debugging. It is particularly appealing in dynamic (scripting) languages, where ease of programming is a dominant design goal. Some existing parallel languages use the type system to enforce determinism statically, but this is not generally practical for dynamic languages. In this paper, we describe how determinism can be obtained-and dynamically enforced/verified-for appropriate extensions to a parallel scripting language. Specifically, we introduce the constructs of Determinis-tic Parallel Ruby (DPR), together with a run-time system (TARDIS) that verifies properties required for determinism, including correct usage of reductions and commutative operators, and the mutual independence (data-race freedom) of concurrent tasks. Experimental results confirm that DPR can provide scalable performance on mul-ticore machines and that the overhead of TARDIS is low enough for practical testing. In particular, TARDIS significantly outperforms alternative data-race detectors with comparable functionality. We conclude with a discussion of future directions in the dynamic enforcement of determinism.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.