Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors
DOI: 10.1109/iccd.2002.1106794
|View full text |Cite
|
Sign up to set email alerts
|

Data Cache design considerations for the Itanium/sub /spl reg// 2 Processor

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
7
0

Publication Types

Select...
8

Relationship

0
8

Authors

Journals

citations
Cited by 16 publications
(8 citation statements)
references
References 7 publications
0
7
0
Order By: Relevance
“…On the Itanium 2 processor, the memory subsystem is decoupled from the execution pipeline and can reorder requests based on a relaxed memory ordering model [12]. At least 48 outstanding requests can be active throughout the memory hierarchy without stalling the execution pipeline [16]. Therefore it is highly beneficial for the code generator to increase memory-level parallelism by clustering loads in the schedule, which means issuing of several load requests in parallel before the first use of a load.…”
Section: Theory Of Latency-tolerant Software Pipeliningmentioning
confidence: 99%
“…On the Itanium 2 processor, the memory subsystem is decoupled from the execution pipeline and can reorder requests based on a relaxed memory ordering model [12]. At least 48 outstanding requests can be active throughout the memory hierarchy without stalling the execution pipeline [16]. Therefore it is highly beneficial for the code generator to increase memory-level parallelism by clustering loads in the schedule, which means issuing of several load requests in parallel before the first use of a load.…”
Section: Theory Of Latency-tolerant Software Pipeliningmentioning
confidence: 99%
“…Scalar accesses are made to the L1 conventional data cache, while vector accesses bypass the L1 to access directly the L2 vector cache. This bypass is somewhat similar to the bypass implemented in Itanium2 processor for the floating point register file [23]. If the L2 port is B×64-bit wide, these accesses are performed at a maximum rate of B elements per cycle when the stride is one, and at 1 element per cycle for any other stride.…”
Section: Memory Hierarchy Modelmentioning
confidence: 99%
“…We assume the bus between L1 cache and L2 cache is 128 bits wide [13] and use this as the input data width of both the compressor and decompressor. Figure 1 illustrates the hardware compression process.…”
Section: C-pack Hardware Implementationmentioning
confidence: 99%