This paper presents a novel hierarchical design of an application-specific instruction set processor (ASIP) tailored for Fast Fourier Transformation (FFT), a kernel data transformation task in digital communication systems, to meet the stringent requirements on throughput and flexibility. We reconstruct the FFT computation flow into a scalable array structure based on an 8-point butterfly unit (BU). The array can easily expand along both the horizontal and vertical dimensions for any-point FFT computation, and contains the same structure for each horizontal stage. We incorporate custom register files to reduce memory access, and derive a regular data addressing rule accordingly. With the microarchitecture modifications, we extend the instruction set with three custom instructions. Our FFT ASIP implementation achieves a data throughput improvement of 866.5X, 5.9X, 2.3X over the standard FFT software implementation, one TI DSP processor, and one commercial ASIP -Xtensa's implementation, respectively. Meanwhile, the area and power consumption overhead of the custom hardware is acceptable.
I. INTRODUCTIONFast Fourier Transformation (FFT), the most timeconsuming block in digital communication systems, is facing both high flexibility and throughput requirements in current 4G wireless systems. It should be easily reprogrammed or reconfigured to support various standards and operating modes. For example, the size of FFT is desired to be changeable under different operation environments. The existing application-specific integrated circuits (ASICs), although can provide high throughput, cannot offer the required flexibility and programmability. On the other hand, high throughput is important for FFT computation as well. Among various communication standards, the 802.15.3 Multi-band Ultra Wide Band (MB-UWB) standard has the highest data rates ranging from 200 to 480 Mbps [1]. This requires a throughput rate of more than 409.6 M sample points per second for FFT computation. Current commercial DSPs, such as Sandbridge's Sandblaster and SODA, have to apply multi-core techniques to meet the real-time processing requirement [2], [3]. Other DSPs like TI's TMS320c6X processor achieve good performance in embedded applications [4], while it uses 256-bit long instructions, which is not energy-efficient for domain-specific applications. As an intermediate design option between ASIC and general purpose DSP, applicationspecific instruction set processor (ASIP) has emerged in recent decades, which can offer both good flexibility with base core software control and high throughput with hardware acceleration, suitable for FFT algorithms [5].Recently, a variety of ASIP implementations have been presented for FFT algorithms, falling into two categories.