Nikos Anastopoulos scite author profile

In this paper we revisit the performance issues of the widely used sparse matrix-vector multiplication (SpMxV) kernel on modern microarchitectures. Previous scientific work reports a number of different factors that may significantly reduce performance. However, the interaction of these factors with the underlying architectural characteristics is not clearly understood, a fact that may lead to misguided and thus unsuccessful attempts for optimization. In order to gain an insight on the details of SpMxV performance, we conduct a suite of experiments on a rich set of matrices for three different commodity hardware platforms. Based on our experiments we extract useful conclusions that can serve as guidelines for the subsequent optimization process of the kernel.

show abstract

Performance evaluation of the sparse matrix-vector multiplication on modern architectures

Goumas

Kourtis

Anastopoulos

et al. 2008

J Supercomput

View full text Add to dashboard Cite

In this paper, we revisit the performance issues of the widely used sparse matrix-vector multiplication (SpMxV) kernel on modern microarchitectures. Previous scientific work reports a number of different factors that may significantly reduce performance. However, the interaction of these factors with the underlying architectural characteristics is not clearly understood, a fact that may lead to misguided, and thus unsuccessful attempts for optimization. In order to gain an insight into the details of SpMxV performance, we conduct a suite of experiments on a rich set of matrices for three different commodity hardware platforms. In addition, we investigate the parallel version of the kernel and report on the corresponding performance results and their relation to each architecture's specific multithreaded configuration. Based on our experiments, we extract useful conclusions that can serve as guidelines for the optimization process of both single and multithreaded versions of the kernel.

show abstract

Facilitating efficient synchronization of asymmetric threads on hyper-threaded processors

Anastopoulos

Koziris

2008

View full text Add to dashboard Cite

Employing Transactional Memory and Helper Threads to Speedup Dijkstra's Algorithm

Nikas

Anastopoulos

Goumas

et al. 2009

View full text Add to dashboard Cite

Abstract-In this paper we work on the parallelization of the inherently serial Dijkstra's algorithm on modern multicore platforms. Dijkstra's algorithm is a greedy algorithm that computes Single Source Shortest Paths for graphs with non-negative edges and is based on the iterative extraction of nodes from a priority queue. This property limits the explicit parallelism of the algorithm and any attempt to utilize the remaining parallelism results in significant slowdowns due to synchronization overheads. To deal with these problems, we employ the concept of Helper Threads (HT) to extract parallelism on a non-traditional fashion and Transactional Memory (TM) to efficiently orchestrate the concurrent threads' accesses to shared data structures. Results demonstrate that the proposed implementation is able to achieve performance speedups (reaching up to 1.84 for 14 threads), indicating that the two paradigms could be efficiently combined.

show abstract

Exploring the performance limits of simultaneous multithreading for memory intensive applications

et al. 2007

View full text Add to dashboard Cite

Simultaneous multithreading (SMT) has been proposed to improve system throughput by overlapping instructions from multiple threads on a single wide-issue processor. Recent studies have demonstrated that diversity of simultaneously executed applications can bring up significant performance gains due to SMT. However, the speedup of a single application that is parallelized into multiple threads, is often sensitive to its inherent instruction level parallelism (ILP), as well as the efficiency of synchronization and communication mechanisms between its separate, but possibly dependent threads. Moreover, as these separate threads tend to put pressure on the same architectural resources, no significant speedup can be observed.In this paper, we evaluate and contrast thread-level parallelism (TLP) and speculative precomputation (SPR) techniques for a series of memory intensive codes executed on a specific SMT processor implementation. We explore the performance limits by evaluating the tradeoffs between ILP and TLP for various kinds of instruction streams. By obtaining knowledge on how such streams interact when executed simultaneously on the processor, and quantifying their presence within each application's threads, we try to interpret the observed performance for each application when parallelized according to the aforementioned techniques. In order to amplify this evaluation process, we also present results gathered from the performance monitoring hardware of the processor.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.