A new class of fast Fourier transform (FFT) architecture, based on the use of distributed memories, is proposed for field-programmable gate arrays (FPGAs). Prominent features are high clock speeds, programmability, reduced look-up-table (LUT) and register usage, simplicity of design, and a capability to do both power-of-two and non-power-of-two FFTs. Higher clock speeds are a consequence of new algorithms and a more fine-grained structure compared to traditional pipelined FFTs, so clock speeds are typically >500 MHz in 65 nm FPGA technology. The programmability derives from the memory-based architecture, which is also scalable. Reduced LUT and register usage arises from a unique methodology to control word growth during computation that achieves high dynamic range, along with inherent systolic circuit characteristics: simple, regular, uniform arrays of processing elements, connected in nearest-neighbor fashion to minimize wiring lengths. The circuit goal was to maximize throughput and minimize the use of the FPGA LUT and register logic fabric. Comparison results from seven different designs, covering a spectrum of functionality (fixed-size, variable, floating-point and variable non-power-of-two FFTs), different FPGA vendors (Intel and Xilinx) and different FPGA types, showed increases in throughput per logic cell up to 181% with an average improvement of 94%.