Vector Lane Threading

Rivoire, Suzanne; Schultz, Roger A.; Okuda, Tetsuji; Kozyrakis, Christos

doi:10.1109/icpp.2006.74

Cited by 33 publications

(14 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Espasa and Valero 15 showed that ILP and DLP can be merged in a single simultaneous vector multithreaded architecture to execute regular vectorizable code at a performance level that cannot be achieved using either paradigm on its own. Rivoire et al 16 proposed vector lane threading (VLT) that allows idle vector lanes to run short-vector or scalar threads by partitioning the vector lanes across several threads. Krashinsky 17 proposed a vector-thread (VT) architecture, which uni¯es the vector and multithreaded compute models.…”

Section: Related Workmentioning

confidence: 99%

Simultaneous Multithreaded Matrix Processor

Soliman

Elsayed

2015

J CIRCUIT SYST COMP

View full text Add to dashboard Cite

This paper proposes a simultaneous multithreaded matrix processor (SMMP) to improve the performance of data-parallel applications by exploiting instruction-level parallelism (ILP) data-level parallelism (DLP) and thread-level parallelism (TLP). In SMMP, the well-known ve-stage pipeline (baseline scalar processor) is extended to execute multi-scalar/vector/matrix instructions on uni¯ed parallel execution datapaths. SMMP can issue four scalar instructions from two threads each cycle or four vector/matrix operations from one thread, where the execution of vector/matrix instructions in threads is done in round-robin fashion. Moreover, this paper presents the implementation of our proposed SMMP using VHDL targeting FPGA Virtex-6. In addition, the performance of SMMP is evaluated on some kernels from the basic linear algebra subprograms (BLAS). Our results show that, the hardware complexity of SMMP is 5.68 times higher than the baseline scalar processor. However, speedups of 4.9, 6.09, 6.98, 8.2, 8.25, 8.72, 9.36, 11.84 and 21.57 are achieved on BLAS kernels of applying Givens rotation, scalar times vector plus another, vector addition, vector scaling, setting up Givens rotation, dotproduct, matrix-vector multiplication, Euclidean length, and matrix-matrix multiplications, respectively. The average speedup over the baseline is 9.55 and the average speedup over complexity is 1.68. Comparing with Xilinx MicroBlaze, the complexity of SMMP is 6.36 times higher, however, its speedup ranges from 6.87 to 12.07 on vector/matrix kernels, which is 9.46 in average.

show abstract

Section: Related Workmentioning

confidence: 99%

Simultaneous Multithreaded Matrix Processor

Soliman

Elsayed

2015

J CIRCUIT SYST COMP

View full text Add to dashboard Cite

show abstract

“…Our technique is complementary to Vector Lane Threading (VLT) [24] and the Vector-Thread (VT) Architecture [14]. VLT assigns groups of lanes to different user-level threads; lanes belonging to the same user-level thread execute in SIMD, but they do not need to execute in lockstep with lanes in other groups.…”

Section: Related Workmentioning

confidence: 99%

Dynamic warp subdivision for integrated branch and memory divergence tolerance

Meng

Tarjan

Skadron

2010

Proceedings of the 37th Annual International Symposium on Computer Architecture

201

159

View full text Add to dashboard Cite

SIMD organizations amortize the area and power of fetch, decode, and issue logic across multiple processing units in order to maximize throughput for a given area and power budget. However, throughput is reduced when a set of threads operating in lockstep (a warp) are stalled due to long latency memory accesses. The resulting idle cycles are extremely costly. Multi-threading can hide latencies by interleaving the execution of multiple warps, but deep multi-threading using many warps dramatically increases the cost of the register files (multi-threading depth × SIMD width), and cache contention can make performance worse. Instead, intra-warp latency hiding should first be exploited. This allows threads that are ready but stalled by SIMD restrictions to use these idle cycles and reduces the need for multi-threading among warps. This paper introduces dynamic warp subdivision (DWS), which allows a single warp to occupy more than one slot in the scheduler without requiring extra register file space. Independent scheduling entities allow divergent branch paths to interleave their execution, and allow threads that hit to run ahead. The result is improved latency hiding and memory level parallelism (MLP). We evaluate the technique on a coherent cache hierarchy with private L1 caches and a shared L2 cache. With an area overhead of less than 1%, experiments with eight data-parallel benchmarks show our technique improves performance on average by 1.7X.

show abstract

“…Vector Architectures: Vector architectures [32,10,21,31,2,4] have a distinct programming model, execution model, and workload characteristics compared to GPGPU architectures. However, the intra-warp compaction techniques proposed in this paper are similar to density time optimizations for addressing vector control flow divergence.…”

Section: Related Workmentioning

confidence: 99%

SIMD divergence optimization through intra-warp compaction

Vaidya

Shayesteh

Woo

et al. 2013

Proceedings of the 40th Annual International Symposium on Computer Architecture

View full text Add to dashboard Cite

SIMD execution units in GPUs are increasingly used for high performance and energy efficient acceleration of general purpose applications. However, SIMD control flow divergence effects can result in reduced execution efficiency in a class of GPGPU applications, classified as divergent applications. Improving SIMD efficiency, therefore, has the potential to bring significant performance and energy benefits to a wide range of such data parallel applications.Recently, the SIMD divergence problem has received increased attention, and several micro-architectural techniques have been proposed to address various aspects of this problem. However, these techniques are often quite complex and, therefore, unlikely candidates for practical implementation. In this paper, we propose two micro-architectural optimizations for GPGPU architectures, which utilize relatively simple execution cycle compression techniques when certain groups of turned-off lanes exist in the instruction stream. We refer to these optimizations as basic cycle compression (BCC) and swizzled-cycle compression (SCC), respectively. In this paper, we will outline the additional requirements for implementing these optimizations in the context of the studied GPGPU architecture. Our evaluations with divergent SIMD workloads from OpenCL (GPGPU) and OpenGL (graphics) applications show that BCC and SCC reduce execution cycles in divergent applications by as much as 42% (20% on average). For a subset of divergent workloads, the execution time is reduced by an average of 7% for today's GPUs or by 18% for future GPUs with a better provisioned memory subsystem. The key contribution of our work is in simplifying the micro-architecture for delivering divergence optimizations while providing the bulk of the benefits of more complex approaches.

show abstract

Vector Lane Threading

Cited by 33 publications

References 24 publications

Simultaneous Multithreaded Matrix Processor

Simultaneous Multithreaded Matrix Processor

Dynamic warp subdivision for integrated branch and memory divergence tolerance

SIMD divergence optimization through intra-warp compaction

Contact Info

Product

Resources

About