Abstract-Device optimization considering supply voltage Vdd and threshold voltage Vt has little chip area increase, but a great impact on power and performance in the nanometer technology. This paper studies simultaneous evaluation of device and architecture optimization for FPGAs. We first develop an efficient yet accurate timing and power evaluation method, called trace-based model. By collecting trace information from cycleaccurate simulation of placed and routed FPGA benchmark circuits and re-using the trace for different Vdd and Vt, we enable device and architecture co-optimization considering hundreds of device and architecture combinations. Compared to the baseline FPGA architecture, which uses the VPR architecture model and the same LUT and cluster sizes as those used by the Xilinx Virtex-II, Vdd suggested by ITRS, and Vt optimized with respect to the above architecture and Vdd, architecture and device cooptimization can reduce energy-delay product by 20.5% and chip area by 23.3%. Furthermore, considering power-gating of unused logic blocks and interconnect switches (in this case sleep transistor size is a parameter of device tuning), our cooptimization reduces energy-delay product by 55.0% and chip area by 8.2% compared to the baseline FPGA architecture. To the best of our knowledge, this is the first in-depth study in the literature on architecture and device co-optimization for FPGAs.