FPGA-Accelerated Simulation Technologies (FAST): Fast, Full-System, Cycle-Accurate Simulators

Chiou, Derek; Sunwoo, Dam; Kim, Joon-Soo; Patil, Nanasaheb M; Reinhart, William; Johnson, D. Eric; Keefe, Jebediah; Angepat, Hari

doi:10.1109/micro.2007.36

Cited by 125 publications

(66 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Kapre et al [9] and Chiou et al [10] expound on this work to achieve speedup of SPICE simulation using FPGA. On the other hand, efforts have been made to use analog simulation to build a database of gate power usage of the library cells for all subsequent simulations [11]- [13].…”

Section: B Gate Level Power Simulationmentioning

confidence: 99%

Accelerating physical level sub-component power simulation by online power partitioning

Bhargav

Kolb

Cho

2016

2016 17th International Symposium on Quality Electronic Design (ISQED)

View full text Add to dashboard Cite

Despite scalable computing resource facilities and industry's ability to accurately model individual components with simulations, fine-grained physical level power simulation for today's typical digital design is found to be too expensive, especially, when parametric variations need to be accounted for modern processes. This paper presents a new linear solver based on-line modeling method that exploit existing slow but accurate simulators to obtain highly accurate sub-circuit power estimates while using significantly less processing and memory resources. Comparative simulation runs for a number of benchmark circuits using industry standard SPICE simulators and our simulator suggest that our technique can accelerate existing simulation speed up to 100x with less memory without sacrificing power estimate accuracy.

show abstract

Section: B Gate Level Power Simulationmentioning

confidence: 99%

Accelerating physical level sub-component power simulation by online power partitioning

Bhargav

Kolb

Cho

2016

2016 17th International Symposium on Quality Electronic Design (ISQED)

View full text Add to dashboard Cite

show abstract

“…So we can't simulate a bigger design with more than 100 cores. Many researchers propose to enhance NoC simulation using multi-threaded parallel core or accelerated using FPGAs and these are HAsim, Protoflex and FAST [20][21][22]. In SimFlex [23], another group of researcher uses statistical sampling of system to speed up the simulation of targeted multicore system.…”

Section: Previous Workmentioning

confidence: 99%

DDGSim: GPU based simulator for large multicore with bufferless NoC

Kumar

Sahu

2014

2014 Annual IEEE India Conference (INDICON)

View full text Add to dashboard Cite

In large scale chip multicore, last level cache management and core interconnection network play important roles in performance and power consumption. And in large scale chip multicore, mesh interconnect is used widely due to scalability and simplicity of design. As interconnection network occupied significant area and consumes significant percent of system power, bufferless network is an appealing alternative design to reduce power consumption and hardware cost. We have designed and implemented a simulator for simulation of distributed cache management of large chip multicore where cores are connected using bufferless interconnection network. Also, we have redesigned and implemented the DDGSim, which is a GPU compatible parallel version of the same simulator using CUDA programming model. We have simulated target large chip multicore with up to 43,000 cores and achieved up to 25 times speedup on NVIDIA GeForce GTX 690 GPU over serial simulation. I. INTRODUCTIONIn large scale chip multicore (LCMP), on-chip cache management and interconnection network have significant impact on performance, power consumption of the system. As the core count of chip multicore increase, the pressure on on-chip cache (in particularly the last level cache (LLC) L2 cache) increase significantly. Single shared cache (physically shared) is not good for performance in terms of access latency and interference among cores. Any how, there are many level of caches in this kind of system, first level cache (L0 an L1) must be private, but the last level cache (the L2 cache) which must be bigger and need to be managed efficiently. Also the completely distributed (physically distributed) cache may not be good for many cases where a core requires a larger portion of cache. Distributed cache suffers from increased local cache pressure and eviction. So logically shared and physically distributed model (LSPD) capture both performance in terms of access time and share effectively. Among various last level cache management models, LSPD model of last level cache is promising in terms of cache utilization and overall system performance. The performance of LSPD model depends on effective policy for the cache block placement, eviction, migration and directory management.Mesh interconnection network to connect the cores in LCMP is widely used as it provides a good trade off between simplicity, scalability and maintainability. As stated in [1-3], the interconnection network in LCMP occupies significant amount of area and consumes around 40% of total power, so bufferless network is promising alternative design to reduce hardware cost and power consumption where overall network traffic is low to medium range.It is good idea to explore all the design spaces of different cache management policies, which suite to large multicore system connected using bufferless interconnection network. So in this work, we have designed an efficient simulator to simulate on chip cache management of LCMP where cores are connected using bufferless network. As most of available simul...

show abstract

“…Another growing area of simulation development is the use of Field-Programmable Gate Arrays (FPGAs) as co-processors [7,8,9,10]. FPGAs have become popular because they can take advantage of the fine-grained parallelism between hardware structures.…”

Section: Introductionmentioning

confidence: 99%

“…We believe these properties make General Purpose GPU processing a strong candidate for simulating the timing partition of a manycore simulator (the CPU would simulate the functional partition). A similar partitioning is used in [7], except using an FPGA to accelerate timing simulation for a single core.…”

Section: Introductionmentioning

confidence: 99%

“…Similarly, much information obtained during timing-such as specific cache tags-are not needed by the functional simulator. This clean division of data results in many simulators being developed with a functional/timing partition, such as [3,6,7]. The functional partition does functional simulation and generates events relevant to the timing partition.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Scalable Multi-cache Simulation Using GPUs

Moeng

Cho

Melhem

2011

2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems

View full text Add to dashboard Cite

Software simulation is the primary tool used for evaluation of processor design. Simulation offers better accuracy than analytical models and is an important evaluation step before actually fabricating a chip. Unfortunately, simulator speeds are slow-a conventional cycle-accurate simulator will be unable to keep up with increasing core counts in modern processor design.Parallel simulation is one method for improving simulation speeds. Two major areas of parallel simulation research are multithreaded simulators and FPGAs as simulation accelerators. Multithreaded simulators can only extract coarse-grained parallelism and must sacrifice accuracy in order to scale well. FPGA-based simulators can extract fine-grained parallelism, but are expensive and difficult to program.We propose using GPUs for architectural simulation, which can take advantage of a high degree of fine-grained parallelism. In addition, they are inexpensive and easier to program compared to FPGAs. To demonstrate our ideas, we implement a tracedriven many-cache simulator using NVIDIA's CUDA toolkit. GPU-accelerated cache simulation displays remarkable scaling with number of simulated caches when compared to serial CPUonly simulation.

show abstract

FPGA-Accelerated Simulation Technologies (FAST): Fast, Full-System, Cycle-Accurate Simulators

Cited by 125 publications

References 16 publications

Accelerating physical level sub-component power simulation by online power partitioning

Accelerating physical level sub-component power simulation by online power partitioning

DDGSim: GPU based simulator for large multicore with bufferless NoC

Scalable Multi-cache Simulation Using GPUs

Contact Info

Product

Resources

About