Exploring the performance limits of simultaneous multithreading for memory intensive applications

Athanasaki, Evangelia; Anastopoulos, Nikos; Kourtis, Kornilios; Koziris, Nectarios

doi:10.1007/s11227-007-0149-x

Cited by 12 publications

(9 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This is quite predictable, since both threads have the same requirements for computational resources because they execute the same code. This is an inherent limitation of SMT machines and is also discussed in [3,14,16].…”

Section: Shared Memory Architecturesmentioning

confidence: 99%

Performance evaluation of the sparse matrix-vector multiplication on modern architectures

et al. 2008

Self Cite

View full text Add to dashboard Cite

In this paper, we revisit the performance issues of the widely used sparse matrix-vector multiplication (SpMxV) kernel on modern microarchitectures. Previous scientific work reports a number of different factors that may significantly reduce performance. However, the interaction of these factors with the underlying architectural characteristics is not clearly understood, a fact that may lead to misguided, and thus unsuccessful attempts for optimization. In order to gain an insight into the details of SpMxV performance, we conduct a suite of experiments on a rich set of matrices for three different commodity hardware platforms. In addition, we investigate the parallel version of the kernel and report on the corresponding performance results and their relation to each architecture's specific multithreaded configuration. Based on our experiments, we extract useful conclusions that can serve as guidelines for the optimization process of both single and multithreaded versions of the kernel.

show abstract

Section: Shared Memory Architecturesmentioning

confidence: 99%

Performance evaluation of the sparse matrix-vector multiplication on modern architectures

et al. 2008

Self Cite

View full text Add to dashboard Cite

show abstract

“…Prefetching helper threads [7], [8]. run along the main application thread on an idle hardware context and speculatively prefetch data into a shared cache, following a technique known as Speculative Precomputation.…”

Section: Related Workmentioning

confidence: 99%

“…In the practical case of an SMT with two hardware threads, Helper Threading dictates that the second thread should perform some useful but different work from the main computation thread. The most interesting example of Helper Threading is Speculative Precomputation [5], [6], in which the helper thread precomputes memory accesses on behalf of the main computation thread, attacking in this way possible bottlenecks due to memory latency [7], [8].…”

Section: Introductionmentioning

confidence: 99%

Overlapping computation and communication in SMT clusters with commodity interconnects

Goumas

Anastopoulos

Koziris

et al. 2009

2009 IEEE International Conference on Cluster Computing and Workshops

Self Cite

View full text Add to dashboard Cite

Abstract-In this paper we focus on optimizing the performance in a cluster of Simultaneous Multithreading (SMT) processors connected with a commodity interconnect (e.g. Gbit Ethernet), by applying overlapping of computation with communication. As a test case we consider the parallelized advection equation and discuss the steps that need to be followed to semantically allow overlapping to occur. We propose an implementation based on the concept of Helper Threading that distributes computation and communication in the two sibling threads of an SMT processor, thus creating an asymmetric pair of execution patterns in each hardware context. Our experimental results in an 8-node cluster interconnected with commodity Gbit Ethernet demonstrate that the proposed implementation is able to achieve substantial performance improvements that can exceed 20% in some cases, by efficiently utilizing the available resources of the SMT processors.

show abstract

“…Most of recent researches on hybrid SRAM and DRAM caches focus mainly on enhancing the overall performance of SRAM (resp., DRAM) by utilizing the merit of DRAM (resp., SRAM). There are also many papers devoted to investigating workload performance: (1) For multi-programmed workloads, prior work discussed the issues of relieving memory contention [10,11], workload balance [12,13] and power related optimization [14]; (2) To improve the performance of memory-intensive workloads, many solutions (e.g., architecture design [15][16][17], OS level method [18][19][20] and feedback control [21,22]) have also been proposed; (3) In the cache system, the improved cache architectures [4,9,23,24] and 3D-stacked DRAM technologies [25][26][27] are used to achieve better workload performance; and so on (a broader overview of related work will be covered in Section 2). Instead, little attention has been paid to designing a last level cache (LLC) scheduling scheme for multi-programmed workloads with different memory footprints.…”

Section: Introductionmentioning

confidence: 99%

Toward multi-programmed workloads with different memory footprints: a self-adaptive last level cache scheduling scheme

Zhang

Guo

et al. 2017

Sci. China Inf. Sci.

View full text Add to dashboard Cite

With the emerging of 3D-stacking technology, the dynamic random-access memory (DRAM) can be stacked on chips to architect the DRAM last level cache (LLC). Compared with static randomaccess memory (SRAM), DRAM is larger but slower. In the existing research papers, a lot of work has been devoted to improving the workload performance using SRAM and stacked DRAM together, ranging from SRAM structure improvement, to optimizing cache tag and data access. Instead, little attention has been paid to designing an LLC scheduling scheme for multi-programmed workloads with different memory footprints. Motivated by this, we propose a self-adaptive LLC scheduling scheme, which allows us to utilize SRAM and 3D-stacked DRAM efficiently, achieving better workload performance. This scheduling scheme employs (1) an evaluation unit, which is used to probe and evaluate the cache information during the process of programs being executed; and (2) an implementation unit, which is used to self-adaptively choose SRAM or DRAM. To make the scheduling scheme work correctly, we develop a data migration policy. We conduct extensive experiments to evaluate the performance of our proposed scheme. Experimental results show that our method can improve the multi-programmed workload performance by up to 30% compared with the state-of-the-art methods.

show abstract

Exploring the performance limits of simultaneous multithreading for memory intensive applications

Cited by 12 publications

References 24 publications

Performance evaluation of the sparse matrix-vector multiplication on modern architectures

Performance evaluation of the sparse matrix-vector multiplication on modern architectures

Overlapping computation and communication in SMT clusters with commodity interconnects

Toward multi-programmed workloads with different memory footprints: a self-adaptive last level cache scheduling scheme

Contact Info

Product

Resources

About