Current developments in microprocessor design favor increased core counts over frequency scaling to improve processor performance and energy efficiency. Coupling this architectural trend with a message-passing protocol helps realize a data-center-on-a-die. The prototype chip (Figs. 5.7.1 and 5.7.7) described in this paper integrates 48 Pentium ™ class IA-32 cores [1] on a 6×4 2D-mesh network of tiled core clusters with high-speed I/Os on the periphery. The chip contains 1.3B transistors. Each core has a private 256KB L2 cache (12MB total on-die) and is optimized to support a message-passing-programming model whereby cores communicate through shared memory. A 16KB message-passing buffer (MPB) is present in every tile, giving a total of 384KB on-die shared memory, for increased performance. Power is kept at a minimum by transmitting dynamic, fine-grained voltage-change commands over the network to an on-die voltage-regulator controller (VRC). Further power savings are achieved through active frequency scaling at the tile granularity. Memory accesses are distributed over four on-die DDR3 controllers for an aggregate peak memory bandwidth of 21GB/s at 4× burst. Additionally, an 8-byte bidirectional system interface (SIF) provides 6.4GB/s of I/O bandwidth. The die area is 567mm 2 and is implemented in 45nm high-κ metal-gate CMOS [2].The design is organized in a 6×4 2D-array of tiles [3] to increase scalability. Each tile is a cluster of two enhanced IA-32 cores sharing a router for inter-tile communication. Cores operate in-order and are two-way superscalar. A 256 entry lookup table (LUT) extension of the 64-entry TLB translates 32-bit virtual addresses to 36-bit physical addresses. The separate L1 instruction and data caches are upsized to 16KB and support both write-through and write-back. Each L1 cache is reinforced by a unified 256KB 4-way write-back L2 cache. The L2 uses a 32-byte line size, matching the cache line size internal to the core, and has a 10-cycle hit latency. The L2 also uses in-line double-error-detection and single-error-correction for improved performance and several programmable sleep modes for power reduction. The L2 cache controller features a time-outand-retry mechanism for increased system reliability.Shared memory coherency is maintained through software protocols, such as MPI and OpenMP [4], in an effort to eliminate the communication and hardware overhead required for a memory coherent 2D-mesh. A new message-passing memory type (MPMT) is introduced as an architectural enhancement to optimize data sharing using these software procedures. A single bit in a core's TLB designates MPMT cache lines. The MPMT retains all the performance benefits of a conventional cache line, but distinguishes itself by addressing non-coherent shared memory. All MPMT cache lines are invalidated before reads/writes to the shared memory to prevent a core from working on stale data. A new instruction, MBINV, is added to the core to invalidate all MPMT cache entries in a single cycle. Subsequent reads/writes to inval...
NAS parallel benchmarks (NPB) are a set of applications commonly used to evaluate parallel systems. We use the NPB-OpenMP version to examine the performance of the Intel's new Xeon Phi co-processor and focus specially on the many-core aspect of the Xeon Phi architecture. A first analysis studies the scalability up to 244 threads on 61 cores, the impact of affinity settings on scaling and compare performance characteristics of Xeon Phi and traditional Xeon CPUs. The application of several well-established optimization techniques allows us to identify common bottlenecks that can specifically impede performance on the Xeon Phi but are not as severe on multi-core CPUs. We also find that many of the OpenMP-parallel loops are too short (in terms of the number of loop iterations) for a balanced execution by 244 threads. New, or redesigned benchmarks will be needed to accommodate the greatly increased number of cores and threads. At the end, we summarize our findings in a set recommendations for performance optimization for Xeon Phi.
The growing gap between sustained and peak performance for scientific applications is a well-known problem in high end computing. The recent development of parallel vector systems offers the potential to bridge this gap for many computational science codes and deliver a substantial increase in computing capabilities. This paper examines the intranode performance of the NEC SX-6 vector processor and the cache-based IBM Power3/4 superscalar architectures across a number of scientific computing areas. First, we present the performance of a microbenchmark suite that examines low-level machine characteristics. Next, we study the behavior of the NAS Parallel Benchmarks. Finally, we evaluate the performance of several scientific computing codes. Results demonstrate that the SX-6 achieves high performance on a large fraction of our applications and often significantly outperforms the cache-based architectures. However, certain applications are not easily amenable to vectorization and would require extensive algorithm and implementation reengineering to utilize the SX-6 effectively.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.