High Speed CPU Simulation Using LTU Dynamic Binary Translation

Jones, D. H.; Topham, Nigel

doi:10.1007/978-3-540-92990-1_6

Cited by 29 publications

(15 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[26,27], but has only been considered more recently for DBT systems [4,19,21]. The reason for this late adoption of region based policies has been presumably the increased latency for compilation and optimisation of larger regions, which has only been addressed recently with the introduction of decoupled, latency-hiding JIT task farms [4].…”

Section: Region Based Dbt Systemsmentioning

confidence: 99%

Efficient code generation in a region-based dynamic binary translator

Spink

Wagstaff

Franke

et al. 2014

Proceedings of the 2014 SIGPLAN/SIGBED Conference on Languages, Compilers and Tools for Embedded Systems

Self Cite

View full text Add to dashboard Cite

Region-based JIT compilation operates on translation units comprising multiple basic blocks and, possibly cyclic or conditional, control flow between these. It promises to reconcile aggressive code optimisation and low compilation latency in performancecritical dynamic binary translators. Whilst various region selection schemes and isolated code optimisation techniques have been investigated it remains unclear how to best exploit such regions for efficient code generation. Complex interactions with indirect branch tables and translation caches can have adverse effects on performance if not considered carefully. In this paper we present a complete code generation strategy for a region-based dynamic binary translator, which exploits branch type and control flow profiling information to improve code quality for the common case. We demonstrate that using our code generation strategy a competitive region-based dynamic compiler can be built on top of the LLVM JIT compilation framework. For the ARM V5T target ISA and SPEC CPU 2006 benchmarks we achieve execution rates of, on average, 867 MIPS and up to 1323 MIPS on a standard X86 host machine, outperforming state-of-the-art QEMU-ARM by delivering a speedup of 264%.

show abstract

Section: Region Based Dbt Systemsmentioning

confidence: 99%

Efficient code generation in a region-based dynamic binary translator

Spink

Wagstaff

Franke

et al. 2014

Proceedings of the 2014 SIGPLAN/SIGBED Conference on Languages, Compilers and Tools for Embedded Systems

Self Cite

View full text Add to dashboard Cite

show abstract

“…frequently executed traces) are passed to the JIT DBT engine for native code generation. More recently [15] we have extended hotspot detection and JIT DBT with the capability to find and translate large translation units (LTU) consisting of multiple traced control-flow-graphs. By increasing the size of translation units it is possible to achieve significant speedups in simulation performance.…”

Section: Hotspot Detection and Jit Dynamic Binary Translationmentioning

confidence: 99%

“…During simulation the code generator accesses the object file and concatenates micro-operations to form a host function that emulates the target instructions within a block. More recent approaches to JIT DBT ISS are presented in [24,27,6,15,7]. Apart from different target platforms these approaches differ in the granularity of translation units (basic blocks vs pages or CFG regions) and their JIT code generation target language (ANSI-C vs LLVM IR).…”

Section: Fast Instruction Set Simulationmentioning

confidence: 99%

“…This paper is concerned with ultra-fast ISS using recently developed just-in-time (JIT) dynamic binary translation (DBT) techniques [27,6,15]. DBT combines interpretive and compiled simulation techniques in order to maintain high speed, observability and flexibility.…”

Section: Introductionmentioning

confidence: 99%

“…However, achieving accurate state and even more so microarchitectural observability remains in tension with high speed simulation. In fact, none of the existing JIT DBT ISS [27,6,15] maintains a detailed performance model.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Cycle-accurate performance modelling in an ultra-fast just-in-time dynamic binary translation instruction set simulator

Böhm

Franke

Topham

2010

2010 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation

Self Cite

View full text Add to dashboard Cite

Abstract. Instruction set simulators (ISS) are vital tools for compiler and processor architecture design space exploration and verification. State-of-the-art simulators using just-in-time (JIT) dynamic binary translation (DBT) techniques are able to simulate complex embedded processors at speeds above 500 MIPS. However, these functional ISS do not provide microarchitectural observability. In contrast, low-level cycle-accurate ISS are too slow to simulate full-scale applications, forcing developers to revert to FPGA-based simulations. In this paper we demonstrate that it is possible to run ultra-high speed cycle-accurate instruction set simulations surpassing FPGA-based simulation speeds. We extend the JIT DBT engine of our ISS and augment JIT generated code with a verified cycle-accurate processor model. Our approach can model any microarchitectural configuration, does not rely on prior profiling, instrumentation, or compilation, and works for all binaries targeting a state-of-the-art embedded processor implementing the ARCompact TM instruction set architecture (ISA). We achieve simulation speeds up to 88 MIPS on a standard x86 desktop computer for the industry standard EEMBC, COREMARK and BIOPERF benchmark suites.

show abstract

SIMinG‐1k: A thousand‐core simulator running on general‐purpose graphical processing units

Raghav

Marongiu

Pinto

et al. 2012

Concurrency and Computation

View full text Add to dashboard Cite

This paper introduces SIMinG-1k-a manycore simulator infrastructure. SIMinG-1k is a graphics processing unit accelerated, parallel simulator for design-space exploration of large-scale manycore systems. It features an optimal trade-off between modeling accuracy and simulation speed. Its main objectives are high performance, flexibility, and ability to simulate thousands of cores. SIMinG-1k can model different architectures (currently, we support ARM (Available from: http://infocenter.arm.com/help/index.jsp?topic=/com. arm.doc.ddi0100i/index.html) and Intel x86) using two-step approac where architecture specific front end is decoupled from a fast and parallel manycore virtual machine running on graphical processing unit platform. We evaluate the simulator for target architecture with up to 4096 cores. Our results demonstrate very high scalability and almost linear speedup with simulation of increasing number of cores. S. RAGHAV ET AL. computing domain, from High Performance Computing (HPC) to embedded systems. Examples of similar architectures may include on-chip manycore accelerators such as the Hypercore Architecture Line from Plurality [1], Platform 2012 [2], or future evolutions of Intel's prototypes Larrabee [3] and Single-Chip Cloud Computer [4].Dark silicon pushes innovations towards specialization where a single chip will include a spectrum of hardware accelerators to access and manipulate data in the cloud workloads with minimal energy.Simulation and virtual prototyping technology must obviously evolve to tackle the numerous challenges inherent in simulating such highly parallel architectures. Current state-of-the-art sequential simulators use SystemC [5], binary translation, smart sampling techniques, or tuneable abstraction levels for hardware description. These kinds of simulation technologies typically have to make a trade-off between simulation accuracy and simulation speed. Because very low-level hardware operations are accurately modeled, simulation is slow. This can lead to unacceptable performance when simulating a huge number of cores. Simulating a parallel system is an inherently parallel task. Individual processor simulation may independently proceed until the point where communication or synchronization with other processors is required. This is the key idea behind parallel simulation technology that distributes the simulation workload over parallel hardware resources. Parallel simulators utilizes the availability of multiple physical processing nodes to increase the simulation rate. However, this requirement may turn out to be much too costly in case of adopting server clusters or computing farms as a host for running simulations. The high cost-in terms of increasing latency and decreasing bandwidth-typically leads to poor scalability because of the synchronization overhead when increasing the number of processing nodes.The development of computer technology has recently led to an unprecedented performance increase of general-purpose graphical processing units (GPGPU). Modern GPGPUs integrat...

show abstract

High Speed CPU Simulation Using LTU Dynamic Binary Translation

Cited by 29 publications

References 14 publications

Efficient code generation in a region-based dynamic binary translator

Efficient code generation in a region-based dynamic binary translator

Cycle-accurate performance modelling in an ultra-fast just-in-time dynamic binary translation instruction set simulator

SIMinG‐1k: A thousand‐core simulator running on general‐purpose graphical processing units

Contact Info

Product

Resources

About