Henry Wong scite author profile

Abstract-Graphics processors (GPU) offer the promise of more than an order of magnitude speedup over conventional processors for certain non-graphics computations. Because the GPU is often presented as a C-like abstraction (e.g., Nvidia's CUDA), little is known about the characteristics of the GPU's architecture beyond what the manufacturer has documented. This work develops a microbechmark suite and measures the CUDA-visible architectural characteristics of the Nvidia GT200 (GTX280) GPU. Various undisclosed characteristics of the processing elements and the memory hierarchies are measured. This analysis exposes undocumented features that impact program performance and correctness. These measurements can be useful for improving performance optimization, analysis, and modeling on this architecture and offer additional insight on the decisions made in developing this GPU.

show abstract

Comparing FPGA vs. custom cmos and the impact on processor microarchitecture

Wong

Betz

Rose

2011

101

View full text Add to dashboard Cite

As soft processors are increasingly used in diverse applications, there is a need to evolve their microarchitectures in a way that suits the FPGA implementation substrate. This paper compares the delay and area of a comprehensive set of processor building block circuits when implemented on custom CMOS and FPGA substrates. We then use the results of these comparisons to infer how the microarchitecture of soft processors on FPGAs should be different from hard processors on custom CMOS.We find that the ratios of the area required by an FPGA to that of custom CMOS for different building blocks varies significantly more than the speed ratios. As area is often a key design constraint in FPGA circuits, area ratios have the most impact on microarchitecture choices. Complete processor cores have area ratios of 17-27× and delay ratios of 18-26×. Building blocks that have dedicated hardware support on FPGAs such as SRAMs, adders, and multipliers are particularly area-efficient (2-7× area ratio), while multiplexers and CAMs are particularly area-inefficient (> 100× area ratio), leading to cheaper ALUs, larger caches of low associativity, and more expensive bypass networks than on similar hard processors. We also find that a low delay ratio for pipeline latches (12-19×) suggests soft processors should have pipeline depths 20% greater than hard processors of similar complexity.

show abstract

Quantifying the Gap Between FPGA and Custom CMOS to Aid Microarchitectural Design

Wong

Betz

Rose

2014

IEEE Trans. VLSI Syst.

View full text Add to dashboard Cite

Abstract-This paper compares the delay and area of a comprehensive set of processor building block circuits when implemented on custom CMOS and FPGA substrates, then uses these results to show how soft processor microarchitectures should be different from those of hard processors. We find that the ratios of the custom CMOS versus FPGA area for different building blocks varies considerably more than the speed ratios, thus, area ratios have more impact on microarchitecture choices. Complete processor cores on an FPGA use 17-27× more area ("area ratio") than the same design implemented in custom CMOS. Building blocks with dedicated hardware support on FPGAs such as SRAMs, adders, and multipliers are particularly area-efficient (2-7×), while multiplexers and content-addressable memories (CAM) are particularly area-inefficient (>100×). Applying these results, we find out-of-order soft processors should use physical register file organizations to minimize CAM size.

show abstract

Efficient methods for out-of-order load/store execution for high-performance soft processors

Wong

Betz

Rose

2013

View full text Add to dashboard Cite

Abstract-As FPGAs continue to increase in size, it becomes increasingly feasible and desirable to build higher performance soft processors. Preserving the familiar single-threaded programming model can be done with an out of order processor. The ability to execute memory loads and stores out of order has a large impact on performance, but this is difficult to do because the dependencies between stores and loads are not known until addresses are computed. Out of order memory disambiguation is traditionally done with CAMs in the load queue and store queue, but large CAMs are inefficient on FPGAs. Store Queue Index Prediction (SQIP) and NoSQ propose to replace CAMs with store-load forwarding prediction and load re-execution.We implement four memory disambiguation schemes (in-order, CAM, SQIP, NoSQ) on a Stratix IV FPGA and evaluate the area and delay trade-offs. We find that CAM area and delay degrade quickly with load/store queue size, while SQIP and NoSQ have little degradation with queue size but have area overhead for prediction and predictor training hardware. SQIP and NoSQ use less area than CAMs beyond 32 and 16 load/store queue entries, respectively, and have higher maximum frequency beyond 4 entries.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Henry Wong

Analyzing CUDA workloads using a detailed GPU simulator

Demystifying GPU microarchitecture through microbenchmarking

Comparing FPGA vs. custom cmos and the impact on processor microarchitecture

Quantifying the Gap Between FPGA and Custom CMOS to Aid Microarchitectural Design

Efficient methods for out-of-order load/store execution for high-performance soft processors

Contact Info

Product

Resources

About