Data Cache design considerations for the Itanium/sub /spl reg// 2 Processor

Lyon, T.; DeLano, E.; McNairy, C.; Mulla, D.

doi:10.1109/iccd.2002.1106794

Cited by 16 publications

(8 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…On the Itanium 2 processor, the memory subsystem is decoupled from the execution pipeline and can reorder requests based on a relaxed memory ordering model [12]. At least 48 outstanding requests can be active throughout the memory hierarchy without stalling the execution pipeline [16]. Therefore it is highly beneficial for the code generator to increase memory-level parallelism by clustering loads in the schedule, which means issuing of several load requests in parallel before the first use of a load.…”

Section: Theory Of Latency-tolerant Software Pipeliningmentioning

confidence: 99%

Latency-tolerant software pipelining in a production compiler

Winkel

Krishnaiyer

Sampson

2008

Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization

View full text Add to dashboard Cite

In this paper we investigate the benefit of scheduling non-critical loads for a higher latency during software pipelining. "Noncritical" denotes those loads that have sufficient slack in the cyclic data dependence graph so that increasing the scheduling distance to their first use can only increase the number of stages of the software pipeline, but should not increase the lengths of the individual stages, the initiation interval (II). The associated cost is in many cases negligible, but the memory stall reduction due to improved latency coverage and load clustering in the schedule can be considerable.We first analyze benefit and cost in theory and then present how we have implemented latency-tolerant pipelining experimentally in the Intel Itanium R product compiler. A key component of the technique is the preselection of likely long-latency loads that is integrated into prefetching heuristics in the high-level optimizer. Only when applied selectively based on these prefetcher hints, the optimization gives the full benefit also without trip-count information from dynamic profiles. Experimental results show gains of up to 14%, with an average of 2.2%, in a wide range of SPEC R CPU2000 and CPU2006 benchmarks. These gains were realized on top of best-performing compiler options typically used for SPEC submissions.

show abstract

Section: Theory Of Latency-tolerant Software Pipeliningmentioning

confidence: 99%

Latency-tolerant software pipelining in a production compiler

Winkel

Krishnaiyer

Sampson

2008

Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization

View full text Add to dashboard Cite

show abstract

“…Scalar accesses are made to the L1 conventional data cache, while vector accesses bypass the L1 to access directly the L2 vector cache. This bypass is somewhat similar to the bypass implemented in Itanium2 processor for the floating point register file [23]. If the L2 port is B×64-bit wide, these accesses are performed at a maximum rate of B elements per cycle when the stride is one, and at 1 element per cycle for any other stride.…”

Section: Memory Hierarchy Modelmentioning

confidence: 99%

On the Scalability of 1- and 2-Dimensional SIMD Extensions for Multimedia Applications

Sánchez

Álvarez

Salamf

et al. 2005

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.

View full text Add to dashboard Cite

“…We assume the bus between L1 cache and L2 cache is 128 bits wide [13] and use this as the input data width of both the compressor and decompressor. Figure 1 illustrates the hardware compression process.…”

Section: C-pack Hardware Implementationmentioning

confidence: 99%

Design and Implementation of a High-Performance Microprocessor Cache Compression Algorithm

Chen¹,

Yang²,

Lekatsas

et al. 2008

Data Compression Conference (Dcc 2008)

View full text Add to dashboard Cite

Researchers have proposed using hardware data compression units within the memory hierarchies of microprocessors in order to improve performance, energy efficiency, and functionality. However, most past work, and in particular work on cache compression, has made unsubstantiated assumptions about the performance, power consumption, and area overheads of the required compression hardware. We present a lossless compression algorithm that has been designed for on-line memory hierarchy compression, and cache compression in particular. We reduced our algorithm to a register transfer level hardware implementation, permitting performance, power consumption, and area estimation. The results of experiments comparing our work to previous work are presented.

show abstract

Data Cache design considerations for the Itanium/sub /spl reg// 2 Processor

Cited by 16 publications

References 7 publications

Latency-tolerant software pipelining in a production compiler

Latency-tolerant software pipelining in a production compiler

On the Scalability of 1- and 2-Dimensional SIMD Extensions for Multimedia Applications

Design and Implementation of a High-Performance Microprocessor Cache Compression Algorithm

Contact Info

Product

Resources

About