Radix-2 k delay feed-back and Radix-K delay commutator are the most well-known pipeline architecture for FFT design. This paper proposes a novel Radix-2 2 multiple delay commutator architecture utilizing the advantages of the Radix-2 2 algorithm such as simple butterflies and less memory requirement. Therefore, it is more hardware efficient when implementing parallelism for higher throughput using multiple delay commutators or feed-forward data paths. Here, we propose an improved input scheduling algorithm based upon memory to eliminate energy required to shift data along the delay lines. A 1024-point FFT processor with two parallel data paths is implemented in 65 nm CMOS process technology. The FFT processor occupies an area of 3.6 mm 2 , successfully operates in the supply voltage range from 0.4 V -1 V and the maximum clock frequency of 600 MHz. For low voltage, high performance applications, the processor is able to operate at 400 MHz and consumes 60.3 mW or 77.2 nJ/FFT generating 800 Msamples/s at 0.6 V supply.