POWER7™, a Highly Parallel, Scalable Multi-Core High End Server Processor

Wendel, D.; Kalla, R.; Warnock, J.; Cargnoni, R.; Chu, Sam; Clabes, Joachim; Dreps, Daniel; Hrusecky, D.; Friedrich, Joshua; Islam, Saiful; Kahle, James A.; Leenstra, J.; Mittal, Gaurav; Paredes, Jose; Pille, J.; Restle, P.J.; Sinharoy, Balaram; Smith, Garriet W.; Starke, William J.; Taylor, Scott A.; Norstrand, J Van; Weitzel, S.; Williams, Peter; Zyuban, Victor

doi:10.1109/jssc.2010.2080611

Cited by 53 publications

(40 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Step 8: Including states 11, 14 to TS which belong to x 2 variable, form conflict pairs (3)(4)(5)(6)(7)(8)(9)(10)(11)(6)(7)(8)(9)(10)(11)(12)(13)(14) for the already existing negative influence pairs of x 3 . This violates the unateness of the current US.…”

Section: Repeat Steps 4 5 6mentioning

confidence: 99%

See 1 more Smart Citation

Optimal Unate Decomposition Method for Synthesis of Mixed CMOS VLSI Circuits

Kadiyala

Samanta

2016

VLSICS

View full text Add to dashboard Cite

show abstract

Section: Repeat Steps 4 5 6mentioning

confidence: 99%

“…This is posing increasing demands for devices operating at low power and high speed [5], [6]. With custom made chips coming into focus, the designers are pushing more and more functionalities on a single chip [7], [8], [9]. In fact, designers are now pushing billions of transistors in a single chip [10].…”

Section: Introductionmentioning

confidence: 99%

Optimal Unate Decomposition Method for Synthesis of Mixed CMOS VLSI Circuits

Kadiyala

Samanta

2016

VLSICS

View full text Add to dashboard Cite

show abstract

“…On the other hand, components with significant state (such as cores or caches) take time to warm up, causing significant transients. While core state is relatively small, the memory wall keeps driving the amount of cache per core up [7,58], making cache-induced inertia a growing concern. Therefore, we focus on shared cache management.…”

Section: Quality Of Service In Cmpsmentioning

confidence: 99%

“…Compared to the 2 MB LLC, all workloads exhibit (a) significantly lower miss rates, and (b) higher cross-request reuse, often going back many requests. Thus, larger eDRAM and 3D-stacked caches are making performance inertia a growing issue for latencycritical workloads (e.g., POWER7+ has 10MB of LLC per core [58], and Haswell has up to 32 MB of L4 per core [7]). …”

Section: Performance Inertiamentioning

confidence: 99%

Ubik

Kasture

Sánchez

2014

Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems

115

View full text Add to dashboard Cite

Chip-multiprocessors (CMPs) must often execute workload mixes with different performance requirements. On one hand, user-facing, latency-critical applications (e.g., web search) need low tail (i.e., worst-case) latencies, often in the millisecond range, and have inherently low utilization. On the other hand, compute-intensive batch applications (e.g., MapReduce) only need high long-term average performance. In current CMPs, latency-critical and batch applications cannot run concurrently due to interference on shared resources. Unfortunately, prior work on quality of service (QoS) in CMPs has focused on guaranteeing average performance, not tail latency.In this work, we analyze several latency-critical workloads, and show that guaranteeing average performance is insufficient to maintain low tail latency, because microarchitectural resources with state, such as caches or cores, exert inertia on instantaneous workload performance. Last-level caches impart the highest inertia, as workloads take tens of milliseconds to warm them up. When left unmanaged, or when managed with conventional QoS frameworks, shared last-level caches degrade tail latency significantly. Instead, we propose Ubik, a dynamic partitioning technique that predicts and exploits the transient behavior of latency-critical workloads to maintain their tail latency while maximizing the cache space available to batch applications. Using extensive simulations, we show that, while conventional QoS frameworks degrade tail latency by up to 2.3×, Ubik simultaneously maintains the tail latency of latency-critical workloads and significantly improves the performance of batch applications.

show abstract

“…However, the chip area occupied by caches is already more than half of the overall chip area [Wendel et al 2011;Fig. 3.…”

Section: Motivationmentioning

confidence: 99%

Temporal-based multilevel correlating inclusive cache replacement

Tian

Khan

Jiménez

2013

TACO

View full text Add to dashboard Cite

Inclusive caches have been widely used in Chip Multiprocessors (CMPs) to simplify cache coherence. However, they have poor performance compared with noninclusive caches not only because of the limited capacity of the entire cache hierarchy but also due to ignorance of temporal locality of the Last-Level Cache (LLC). Blocks that are highly referenced (referred to as hot blocks) are always hit in higher-level caches (e.g., L1 cache) and are rarely referenced in the LLC. Therefore, they become replacement victims in the LLC. Due to the inclusion property, blocks evicted from the LLC have to also be invalidated from higher-level caches. Invalidation of hot blocks from the entire cache hierarchy introduces costly off-chip misses that makes the inclusive cache perform poorly.Neither blocks that are highly referenced in the LLC nor blocks that are highly referenced in higherlevel caches should be the LLC replacement victims. We propose temporal-based multilevel correlating cache replacement for inclusive caches to evict blocks in the LLC that are also not hot in higher-level caches using correlated temporal information acquired from all levels of a cache hierarchy with minimal overhead. Invalidation of these blocks does not hurt the performance. By contrast, replacing them as early as possible with useful blocks helps improve cache performance. Based on our experiments, in a dual-core CMP, an inclusive cache with temporal-based multilevel correlating cache replacement significantly outperforms an inclusive cache with traditional LRU replacement by yielding an average speedup of 12.7%, which is comparable to an enhanced noninclusive cache, while requiring less than 1% of storage overhead.

show abstract

POWER7™, a Highly Parallel, Scalable Multi-Core High End Server Processor

Cited by 53 publications

References 10 publications

Optimal Unate Decomposition Method for Synthesis of Mixed CMOS VLSI Circuits

Optimal Unate Decomposition Method for Synthesis of Mixed CMOS VLSI Circuits

Ubik

Temporal-based multilevel correlating inclusive cache replacement

Contact Info

Product

Resources

About