Multi-Channel FFT Architectures Designed via Folding and Interleaving

Unnikrishnan, Nanda K.; Parhi, Keshab K.

doi:10.1109/iscas48785.2022.9937347

Cited by 4 publications

(1 citation statement)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The NTT and iNTT designs are inspired by the design of parallel FFT architectures based on folding sets [26], [27]. Parallel NTT architectures based on folding sets was presented in our earlier work [28].…”

Section: Ntt Pementioning

confidence: 99%

PaReNTT: Low-Latency Parallel Residue Number System and NTT-Based Long Polynomial Modular Multiplication for Homomorphic Encryption

Tan¹,

Chiu²,

Wang³

et al. 2023

Preprint

View full text Add to dashboard Cite

High-speed long polynomial multiplication is important for applications in homomorphic encryption (HE) and lattice-based cryptosystems. This paper addresses low-latency hardware architectures for long polynomial modular multiplication using the number-theoretic transform (NTT) and inverse NTT (iNTT). Chinese remainder theorem (CRT) is used to decompose the modulus into multiple smaller moduli. Our proposed architecture, namely PaReNTT, makes four novel contributions. First, parallel NTT and iNTT architectures are proposed to reduce the number of clock cycles to process the polynomials. This can enable real-time processing for HE applications, as the number of clock cycles to process the polynomial is inversely proportional to the level of parallelism. Second, the proposed architecture eliminates the need for permuting the NTT outputs before their product is input to the iNTT. This reduces latency by n/4 clock cycles, where n is the length of the polynomial, and reduces buffer requirement by one delay-switch-delay circuit of size n. Third, an approach to select special moduli is presented where the moduli can be expressed in terms of a few signed powerof-two terms. Fourth, novel architectures for pre-processing for computing residual polynomials using the CRT and postprocessing for combining the residual polynomials are proposed. These architectures significantly reduce the area consumption of the pre-processing and post-processing steps. The proposed long modular polynomial multiplications are ideal for applications that require low latency and high sample rate as these feedforward architectures can be pipelined at arbitrary levels. The experimental results show that the proposed architecture reduces the area-block processing product (ABP) by a factor of 43.2 times with respect to LUT and 11.5 times with respect to DSP, when compared without the use of CRT, for a polynomial degree of 4096 and word-length of 192 bits, for a two-parallel architecture.

show abstract

Section: Ntt Pementioning

confidence: 99%