Performance of multi-process and multi-thread processing on multi-core SMT processors

Inoue, Hiroshi; Nakatani, Toshio

doi:10.1109/iiswc.2010.5650174

Cited by 9 publications

(4 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Although the widely available multi-core processors are mainly based on a cache-coherent NUMA design, they differ in the way they have implemented multi-threading to exploit instruction-level and thread-level parallelism. These differences are not only in the size and speed of the cache, but also in the number of threads that can share resources simultaneously [7], the memory-controller mechanism and the inter-processor connector designs that are employed on and off the chip [8]. After selecting the language, the subsequent selection of a virtual machine and/or operating system, from a wide range of options, increases the complexity of the problem even further.…”

Section: Introductionmentioning

confidence: 99%

A Study of the Behavior of Synchronization Methods in Commonly Used Languages and Systems

Cederman

Chatterjee

Nguyen

et al. 2013

2013 IEEE 27th International Symposium on Parallel and Distributed Processing

View full text Add to dashboard Cite

Synchronization is a central issue in concurrency and plays an important role in the behavior and performance of modern programmes. Programming languages and hardware designers are trying to provide synchronization constructs and primitives that can handle concurrency and synchronization issues efficiently. Programmers have to find a way to select the most appropriate constructs and primitives in order to gain the desired behavior and performance under concurrency. Several parameters and factors affect the choice, through complex interactions among (i) the language and the language constructs that it supports, (ii) the system architecture, (iii) possible run-time environments, virtual machine options and memory management support and (iv) applications. We present a systematic study of synchronization strategies, focusing on concurrent data structures. We have chosen concurrent data structures with different number of contention spots. We consider both coarse-grain and fine-grain locking strategies, as well as lock-free methods. We have investigated synchronization-aware implementations in C++, C# (.NET and Mono) and Java. Considering the machine architectures, we have studied the behavior of the implementations on both Intel's Nehalem and AMD's Bulldozer. The properties that we study are throughput and fairness under different workloads and multiprogramming execution environments. For NUMA architectures fairness is becoming as important as the typically considered throughput property. To the best of our knowledge this is the first systematic and comprehensive study of synchronization-aware implementations. This paper takes steps towards capturing a number of guiding principles and concerns for the selection of the programming environment and synchronization methods in connection to the application and the system characteristics.

show abstract

Section: Introductionmentioning

confidence: 99%

A Study of the Behavior of Synchronization Methods in Commonly Used Languages and Systems

Cederman

Chatterjee

Nguyen

et al. 2013

2013 IEEE 27th International Symposium on Parallel and Distributed Processing

View full text Add to dashboard Cite

show abstract

“…On the other hand, the type of call functions in graphic card, GPU synchronization with CPU when running is completed, functions which are used to determine time, version and technical specification of graphic card, and number of experiments are important factors, which influence on comparison between running time of CPU and GPU. Unfortunately, many papers haven't mentioned impacting factors on running time of CPU and GPU [17][18][19].…”

Section: -Related Workmentioning

confidence: 99%

Optimizing Raytracing Algorithm Using CUDA

Razian

MahvashMohammadi

2017

IJSE

View full text Add to dashboard Cite

Now, there are many codes to generate images using raytracing algorithm, which can run on CPU or GPU in single or multi-thread methods. In this paper, an optimized algorithm has been designed to generate image using raytracing algorithm to run on CPU or GPU in multi-thread algorithm. This algorithm employs light with depth of 8 to generate images. It is optimized by changing pixel travel priority and ray of light to thread, dedicating depth function to empty threads, and using optimized functions from MSDN library. Its code has been written in C++ and CUDA. In addition, we do the following to show its performance: comparing implementation in different compiler mode, changing thread number, examining different resolution, and investigating data bandwidth. The results show that one can generate at least 11 frames per second in HD (720p) resolution by GPU processor and GT 840M graphic card, using trace method. If better graphic card employ, this algorithm and program can be used to generate real-time animation.

show abstract

“…Chandramowlishwarany, Madduri and Vuduc [10] describe their effort to characterize and tune a fast multipole method (FMM) application on several CMP/CMT systems (Intel Harpertown, AMD Barcelona, and Intel Nehalem). Inoue and Nakatani [11] study the effect of multi-process and multithread programming model on CMP/CMT processor for Java benchmarks and PHP applications. The authors conduct their experiments on Sun Niagara T1 (8 cores, 4 way SMT) and on Intel Nehalem (4 cores, 2 way SMT) and conclude that, while core scalability for the two programming models is comparable, SMT scalability is higher for the multi-thread programming model mainly because of the difference in the number of data TLB misses.…”

Section: Related Workmentioning

confidence: 99%

Evaluating Performance and Power Efficiency of Scientific Applications on Multi-threaded Systems

Gioiosa

Kerbyson

Hoisie

2014

2014 Energy Efficient Supercomputing Workshop

View full text Add to dashboard Cite

The power and energy walls are changing the way users utilize supercomputers: Time-to-completion is not the only important goal but other metrics, such as the energy required to solve a problem or the power efficiency, are becoming as important as performance. This shift towards power-and energyaware computing is expected to continue in the exascale era, thus, understanding the performance, power and energy implications of different hardware configurations is of paramount importance.

show abstract

Performance of multi-process and multi-thread processing on multi-core SMT processors

Cited by 9 publications

References 11 publications

A Study of the Behavior of Synchronization Methods in Commonly Used Languages and Systems

A Study of the Behavior of Synchronization Methods in Commonly Used Languages and Systems

Optimizing Raytracing Algorithm Using CUDA

Evaluating Performance and Power Efficiency of Scientific Applications on Multi-threaded Systems

Contact Info

Product

Resources

About