Clustered speculative multithreaded processors

Marcuello, Pedro; González, Antonio

doi:10.1145/305138.305214

Cited by 149 publications

(134 citation statements)

References 38 publications

Supporting

Mentioning

132

Contrasting

Unclassified

Order By: Relevance

“…A speculative thread in this model is identified by a SP-CQIP pair [14], where SP stands for the Spawning Point, i.e. the instruction in the execution stream where the speculative thread's execution is triggered.…”

Section: A Speculative Threadsmentioning

confidence: 99%

“…the instruction from which the speculative thread begins execution. The choice of these pairs strongly affects the performance achieved by the system [14].…”

Section: A Speculative Threadsmentioning

confidence: 99%

“…Previous proposals on speculative multithreading [2] [14][18] [20][21] [29] mainly differ on how speculative threads are selected and the way inter-thread data dependences are managed. The speculative threads can be generated by the compiler [12] or detected at run-time [16].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

P-slice based efficient speculative multithreading

Ranjan

Marcuello

Latorre

et al. 2009

2009 International Conference on High Performance Computing (HiPC)

Self Cite

View full text Add to dashboard Cite

Abstract-Microprocessor industry has recently shifted towards multi-core to take advantage of the ever increasing number of transistors provided by the new technologies. Unfortunately, the multi-core approach does not allow single threaded applications to benefit from the additional cores to improve their execution time. Speculative multithreading (SpMT) has been proposed in the past to boost performance of irregular applications in multi-core environments. In this work, we study the main bottlenecks of these architectures, such as the memory behavior and the pre-computation slices and propose two novel schemes that allow SpMT to get 25% average speedup over single threaded execution. We propose Selective Replication as a technique to improve the performance of the SpMT memory system. This technique does not introduce additional traffic in the bus and improves the performance of a conventionalSpMT memory model by 6% on average and up to 21% for some applications. Also, we propose a scheme called Slice Specialization that reduces the number of instructions in the pre-computation slices by adapting the slice to every single speculative thread spawned. The later proposal outperforms previous schemes with slices by 15% and overall, both techniques combined achieve an improvement of 20% over a conventional SpMT processor.

show abstract

Section: A Speculative Threadsmentioning

confidence: 99%

“…the instruction from which the speculative thread begins execution. The choice of these pairs strongly affects the performance achieved by the system [14].…”

Section: A Speculative Threadsmentioning

confidence: 99%

See 1 more Smart Citation

P-slice based efficient speculative multithreading

Ranjan

Marcuello

Latorre

et al. 2009

2009 International Conference on High Performance Computing (HiPC)

Self Cite

View full text Add to dashboard Cite

show abstract

“…is difficult because of pointer aliasing, irregular array accesses, and complex control flow. Thread-level speculation (TLS) [3,6,9,11,16,22,24,26] facilitates the parallelization of such applications by allowing potentially dependent threads to execute in parallel while maintaining the original sequential semantics of the programs through runtime checking. Although researchers have proposed numerous techniques for providing the proper hardware [17,18,23,25] and compiler [27][28][29] support for improving the efficiency of TLS, how to provide adequate compiler support for decomposing sequential programs into parallel threads that can deliver the desired performance has not yet been explored with the proper depth.…”

Section: Introductionmentioning

confidence: 99%

Loop Selection for Thread-Level Speculation

Wang

Dai

Yellajyosula

et al. 2006

Languages and Compilers for Parallel Computing

View full text Add to dashboard Cite

Abstract. Thread-level speculation (TLS) allows potentially dependent threads to speculatively execute in parallel, thus making it easier for the compiler to extract parallel threads. However, the high cost associated with unbalanced load, failed speculation, and inter-thread value communication makes it difficult to obtain the desired performance unless the speculative threads are carefully chosen.In this paper, we focus on extracting parallel threads from loops in generalpurpose applications because loops, with their regular structures and significant coverage on execution time, are ideal candidates for extracting parallel threads. General-purpose applications, however, usually contain a large number of nested loops with unpredictable parallel performance and dynamic behavior, thus making it difficult to decide which set of loops should be parallelized to improve overall program performance. Our proposed loop selection algorithm addresses all these difficulties. We have found that (i) with the aid of profiling information, compiler analyses can achieve a reasonably accurate estimation of the performance of parallel execution, and that (ii) different invocations of a loop may behave differently, and exploiting this dynamic behavior can further improve performance. With a judicious choice of loops, we can improve the overall program performance of SPEC2000 integer benchmarks by as much as 20%.

show abstract

“…rePlay [21] does perform DBO on short atomic traces (16 to 256 instructions long), but they are not suitable for parallelization purposes. Before the many-core era, some systems were proposed [22,23,24,25] to use hardware-only technologies to speculate multiple consecutive atomic traces and execute them simultaneously on different functional units. In order to achieve reasonable speculation accuracy, however, these systems construct very short traces, which necessitates ultra-low communication latency to support program state transfer.…”

Section: Trace Construction and Predictionmentioning

confidence: 99%

Trace-Based Dynamic Binary Parallelization

Yang¹

View full text Add to dashboard Cite

With the number of cores increasing rapidly but the performance per core increasing slowly at best, software must be parallelized in order to improve performance. Manual parallelization is often prohibitively time-consuming and error-prone (especially due to data races and memory-consistency complexities), and some portions of code may simply be too difficult to understand or refactor for parallelization. Most existing automatic parallelization techniques are performed statically at compile time and require source code to be analyzed, leaving a large fraction of software behind.In many cases, some or all of the source code and development tool chain is lost or, in the case of third-party software, was never available. Furthermore, modern applications are assembled and defined at run time, making use of shared libraries, virtual functions, plugins, dynamically-generated code, and other dynamic mechanisms, as well as multiple languages. All these aspects of separate compilation prevent the compiler from obtaining a holistic view of the program, leading to the risk of incompatible parallelization techniques, subtle data races, and resource over-subscription. All the above considerations motivate dynamic binary parallelization (DBP).This dissertation explores the novel idea of trace-based DBP, which provides a large instruction window without introducing spurious dependencies. We hypothesize that traces provide a generally good trade-off between code visibility and analysis accuracy for a wide variety of applications so as to achieve better parallel performance. Compared to the raw dynamic instruction stream (DIS), traces expose more distant parallelism opportunities because their average length is typically much larger than the size of the hardware instruction window. Compared to the complete control flow graph (CFG), traces only contain control and data dependencies on the execution path which is actually taken. More importantly, while DIS-based DBP typically only exploits fine-grained parallelism and CFG-based DBP typically only exploits coarse-grained parallelism, traces can be used as a unified representation of program execution to seamlessly incorporate the exploitation of both coarse-and fine-grained parallelism.We develop Tracy, an innovative DBP framework which monitors a program at run time and i Abstract ii dynamically identifies hot traces, parallelizes them, and caches them for later use so that the program can run in parallel every time a hot trace repeats. Our experimental results have demonstrated that for floating point benchmarks, Tracy can achieve an average speedup of 2.16x, 1.51x better than the speedup achieved by Core Fusion, one representative of DIS-based DBP techniques. Although the average speedup achieved by Tracy is only 1.04x better than the speedup achieved by CFG-based DBP, Tracy can speed up all floating point benchmarks while CFG-based DBP fails to parallelize three out of eight applications at all. The performance of Tracy is not always better than the performance of exist...

show abstract

Clustered speculative multithreaded processors

Cited by 149 publications

References 38 publications

P-slice based efficient speculative multithreading

P-slice based efficient speculative multithreading

Loop Selection for Thread-Level Speculation

Trace-Based Dynamic Binary Parallelization

Contact Info

Product

Resources

About