Characterization of simultaneous multithreading (SMT) efficiency in POWER5

Mathis, H. M.; Mericas, Alex E.; McCalpin, John D.; Eickemeyer, Richard J.; Kunkel, S.

doi:10.1147/rd.494.0555

Cited by 19 publications

(12 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Mathis et al [4] evaluate and analyze the effect of SMT2 on the POWER5 CPU with single-threaded applications. To measure the SMT2 gain of an application, they simply run one copy of the application per available hardware thread/context with and without SMT.…”

Section: Related Workmentioning

confidence: 99%

“…Several studies have shown that SMT does not always improve the performance of applications [3], [4], [5]. The performance gains from SMT vary depending on a number of factors: The scalability of the workload, the CPU resources used by the workload, the instruction mix of the workload, the cache footprint of the workload, the degree of sharing among the software threads, etc.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

An SMT-Selection Metric to Improve Multithreaded Applications' Performance

Funston

Maghraoui

Jann

et al. 2012

2012 IEEE 26th International Parallel and Distributed Processing Symposium

View full text Add to dashboard Cite

Abstract-Simultaneous multithreading (SMT) increases CPU utilization and application performance in many circumstances, but it can be detrimental when performance is limited by application scalability or when there is significant contention for CPU resources. This paper describes an SMT-selection metric that predicts the change in application performance when the SMT level and number of application threads are varied. This metric is obtained online through hardware performance counters with little overhead, and allows the application or operating system to dynamically choose the best SMT level.We have validated the SMT-selection metric using a variety of benchmarks that capture various application characteristics on two different processor architectures. Our results show that the SMT-selection metric is capable of predicting the best SMT level for a given workload in 90% of the cases. The paper also shows that such a metric can be used with a scheduler or application optimizer to help guide its optimization decisions.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

An SMT-Selection Metric to Improve Multithreaded Applications' Performance

Funston

Maghraoui

Jann

et al. 2012

2012 IEEE 26th International Parallel and Distributed Processing Symposium

View full text Add to dashboard Cite

show abstract

“…SMT also suffers from the problem of interference between threads. This interference necessitates increasing the size of structures like the physical register file, the data cache and reorder buffer as well as increasing the width of the superscalar processor to provide performance and power characteristics that are commensurate with the hardware overheads of SMT [16,17].…”

Section: Comparison With Smtmentioning

confidence: 99%

Multiplexed redundant execution: A technique for efficient fault tolerance in chip multiprocessors

Subramanyan

Singh

Saluja

et al. 2010

2010 Design, Automation &Amp; Test in Europe Conference &Amp; Exhibition (DATE 2010)

View full text Add to dashboard Cite

Abstract-Continued CMOS scaling is expected to make future microprocessors susceptible to transient faults, hard faults, manufacturing defects and process variations causing fault tolerance to become important even for general purpose processors targeted at the commodity market.To mitigate the effect of decreased reliability, a number of fault-tolerant architectures have been proposed that exploit the natural coarse-grained redundancy available in chip multiprocessors (CMPs). These architectures execute a single application using two threads, typically as one leading thread and one trailing thread. Errors are detected by comparing the outputs produced by these two threads. These architectures schedule a single application on two cores or two thread contexts of a CMP. As a result, besides the additional energy consumption and performance overhead that is required to provide fault tolerance, such schemes also impose a throughput loss. Consequently a CMP which is capable of executing 2n threads in non-redundant mode can only execute half as many (n) threads in fault-tolerant mode.In this paper we propose multiplexed redundant execution (MRE), a low-overhead architectural technique that executes multiple trailing threads on a single processor core. MRE exploits the observation that it is possible to accelerate the execution of the trailing thread by providing execution assistance from the leading thread. Execution assistance combined with coarse-grained multithreading allows MRE to schedule multiple trailing threads concurrently on a single core with only a small performance penalty. Our results show that MRE increases the throughput of fault-tolerant CMP by 16% over an ideal dual modular redundant (DMR) architecture.

show abstract

“…Entire program tuning will be more complicated, but we believe algorithm or componentlevel tuning in the style we describe will be a useful starting point. Secondly, we choose to characterize the overall process by "level of practitioner," where the analysis and optimization techniques that require the least expertise are likely to be the simplest to generalize and to apply to other programs; and, more importantly, the easiest to automate and to incorporate into existing performance analysis tools [13], [15], [24]- [31]. Table I provides a summary of our evaluation architectures.…”

Section: Introductionmentioning

confidence: 99%

Diagnosis, Tuning, and Redesign for Multicore Performance: A Case Study of the Fast Multipole Method

Chandramowlishwaran

Madduri

Vuduc

2010

2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis

View full text Add to dashboard Cite

Abstract-Given a program and a multisocket, multicore system, what is the process by which one understands and improves its performance and scalability? We describe an approach in the context of improving within-node scalability of the fast multipole method (FMM). Our process consists of a systematic sequence of modeling, analysis, and tuning steps, beginning with simple models, and gradually increasing their complexity in the quest for deeper performance understanding and better scalability. For the FMM, we significantly improve within-node scalability; for example, on a quad-socket Intel Nehalem-EX system, we show speedups of 1.7× over the previous best multithreaded implementation, 19.3× over a sequential but highly tuned (e.g., SIMD-vectorized) code, and match or outperform a state-ofthe-art GPGPU implementation. Our study sheds new light on the form of a more general performance analysis and tuning process that other multicore/manycore tuning practitioners (enduser programmers) and automated performance analysis and tuning tools could themselves apply.

show abstract

Characterization of simultaneous multithreading (SMT) efficiency in POWER5

Cited by 19 publications

References 16 publications

An SMT-Selection Metric to Improve Multithreaded Applications' Performance

An SMT-Selection Metric to Improve Multithreaded Applications' Performance

Multiplexed redundant execution: A technique for efficient fault tolerance in chip multiprocessors

Diagnosis, Tuning, and Redesign for Multicore Performance: A Case Study of the Fast Multipole Method

Contact Info

Product

Resources

About