Niket K. Choudhary scite author profile

ACM Trans. Archit. Code Optim.

Tuck

2012

Extracting high memory-level parallelism (MLP) is essential for speeding up single-threaded applications which are memory bound. At the same time, the projected amount of dark silicon (the fraction of the chip powered off) on a chip is growing. Hence, Asymmetric Multicore Processors (AMP) offer a unique opportunity to integrate many types of cores, each powered at different times, in order to optimize for different regions of execution. In this work, we quantify the potential for exploiting core customization to speedup programs during regions of high MLP. Based on a careful design space exploration, we discover that an AMP that includes a narrow and fast specialized core has the potential to efficiently exploit MLP. Using the results of our analysis, we design an AMP with both an MLP and ILP specialized core, and we propose a hardware-level, application steering mechanism called Symbiotic Core Execution (SCE). SCE detects MLP phases by monitoring the L2 miss rate of the application, and it uses that information to steer the application to the best core. Interestingly, we show that L2 miss rates are important for deciding when an MLP region begins and when it ends. As a program runs, its execution migrates to a core customized for MLP during regions of high MLP; when the region ends, it is rescheduled on the core that fits the application characteristics. Compared to a monolithic core optimized for both modes of operation, our AMP design provides a harmonic mean performance improvement of 5.3% and 6.6% for SPEC2000 and SPEC2006, respectively, with a maximum speedup of 14.5%. For the same study, it achieves a 18.3% and 21.1% energy delay 2 reduction for SPEC2000 and SPEC2006, respectively. Our findings yield an important message for designing AMPs with specialized cores: core customization enables efficient exploitation of MLP, and application steering mechanisms for MLP are simple to implement and effective.

Core-Selectability in Chip Multiprocessors

Najaf-abadi

Rotenberg

2009

The centralized structures necessary for the extraction of instruction-level parallelism (ILP) are consuming progressively smaller portions of the total die area of chip multiprocessors (CMP). The reason for this is that scaling these structures does not enhance general performance as much as scaling the cache and interconnect. However, the fact that these structures now consume less proportional die area opens an avenue to enhancing their performance through truly overcoming the one-size-fits-all approach to their design. This paper proposes core-selectability -incorporating differently-designed cores that can be toggled into active employment. This enables differently customized ILP-extracting structures to be at hand in the system while not dramatically adding to the interconnect complexity. The design verification effort is minimized by separating the complexity of different core designs. Moreover, contrary to alternative approaches, the performance and power efficiency of the core designs are not compromised. Evaluation results are presented that show that, even when limiting the diversity between core designs to only the sizing of microarchitectural structures, core-selectability has the potential to provide notable performance enhancement (with an average of 10%) to scalable multithreaded applications, without increased concurrency. In addition, it can provide significantly greater throughput to multiprogrammed workloads by providing the potential for the system to transform into a heterogeneous design.

FPGA modeling of diverse superscalar processors

Dwiel

Rotenberg

2012

There is increasing interest in using Field Programmable Gate Arrays (FPGAs) as platforms for computer architecture simulation. This paper is concerned with modeling superscalar processors with FPGAs. To be transformative, the FPGA modeling framework should meet three criteria.(1) Configurable: The framework should be able to model diverse superscalar processors, like a software model. In particular, it should be possible to vary superscalar parameters such as fetch, issue, and retire widths, depths of pipeline stages, queue sizes, etc.(2) Automatic: The framework should be able to automatically and efficiently map any one of its superscalar processor configurations to the FPGA. (3) Realistic: The framework should model a modern superscalar microarchitecture in detail, ideally with prototype quality, to enable a new era and depth of microarchitecture research. A framework that meets these three criteria will enjoy the convenience of a software model, the speed of an FPGA model, and the experience of a prototype. This paper describes FPGA-Sim, a configurable, automatically FPGAsynthesizable, and register-transfer-level (RTL) model of an out-of-order superscalar processor. FPGA-Sim enables FPGA modeling of diverse superscalar processors out-of-the-box. Moreover, its direct RTL implementation yields the fidelity of a hardware prototype.

FabScalar: Automating Superscalar Core Design

et al. 2012

An Exploration of OpenCL on Multiple Hardware Platforms for a Numerical Relativity Application

Navada²,

Ginjupalli³

et al. 2011

Currently there is considerable interest in making use of many-core processor architectures, such as Nvidia and AMD graphics processing units (GPUs) for scientific computing. In this work we explore the use of the Open Computing Language (OpenCL) for a typical Numerical Relativity application: a time-domain Teukolsky equation solver (a linear, hyperbolic, partial differential equation solver using finite-differencing). OpenCL is the only vendor-agnostic and multi-platform parallel computing framework that has been adopted by all major processor vendors. Therefore, it allows us to write portable source-code and run it on a wide variety of compute hardware and perform meaningful comparisons. The outcome of our experimentation suggests that it is relatively straightforward to obtain order-of-magnitude gains in overall application performance by making use of many-core GPUs over multi-core CPUs and this fact is largely independent of the specific hardware architecture and vendor. We also observe that a single high-end GPU can match the performance of a small-sized, message-passing based CPU cluster.