A High-Level Synthesis Approach to the Software/Hardware Codesign of NTT-Based Post-Quantum Cryptography Algorithms

Nguyen, Duc Tri; Dang, Viet Ba; Gaj, Kris

doi:10.1109/icfpt47387.2019.00070

Cited by 25 publications

(15 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The presented components are able to run at the clock frequency 637 MHz while the value reached by Nguyen et al [3] is only 445 MHz. It is remarkable that Nguyen et al [2] reached exactly the same latency for the both implementation strategies.…”

Section: Implementation Results and Comparisonmentioning

confidence: 70%

“…The proposed HDL-based design needs on average 20 times less hardware resources and is 6 times faster than the implementation of Nejatollahi et al [3]. In comparison to the implementations (HLS-based and HDL-based) of Nguyen et al [2], their designs have comparable results of the hardware utilization but they are not so much optimized in terms of the clock frequency and the speed as the presented ones. The presented components are able to run at the clock frequency 637 MHz while the value reached by Nguyen et al [3] is only 445 MHz.…”

Section: Implementation Results and Comparisonmentioning

confidence: 82%

“…However, neither of them mention the implementation of the inverse NTT. Nguyen et al [2] compare the efficiency of the HDL (Hardware Description Language) and HLS (High Level Synthesis) implementations on the UltraScale+ architecture. Their implementation does not fully meet the reference implementation because they replaced the Montgomery reductions by so called K-reductions.…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

VHDL-based Implementation Of NTT On FPGA

Jedlička¹

2021

Proceedings II of the 27st Conference STUDENT EEICT 2021

View full text Add to dashboard Cite

This paper is focused on the effective hardware-accelerated implementation of NTT (Number Theoretic Transform) and inverse NTT (NTT −1 ) on FPGA (Field Programmable Gate Array). The discussed implementation is intended for the use in the lattice-based cryptography schemes, e.g. CRYSTALS-Dilithium digital signature scheme which is one of the finalists of the third round in the post-quantum standardization process under the auspices of NIST (The National Institute of Standards and Technology). The implementation of NTT (NTT −1 ) requires 1798 (2547) Look-Up Tables (LUTs), 2532 (3889) Flip-Flops (FFs) and 48 (84) Digital Signal Processing blocks (DSPs). The latency of the design is 502 (517) clock cycles at the frequency 637 MHz on Xilinx Virtex UltraScale+ architecture which makes the presented implementation to be currently the fastest one. Regarding the inverse NTT, this is the first implementation at all.

show abstract

Section: Implementation Results and Comparisonmentioning

confidence: 70%

Section: Implementation Results and Comparisonmentioning

confidence: 82%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

VHDL-based Implementation Of NTT On FPGA

Jedlička¹

2021

Proceedings II of the 27st Conference STUDENT EEICT 2021

View full text Add to dashboard Cite

show abstract

“…A methodology was proposed in [20] for optimizing NTT loops structure, via loop flattening and trip count reduction to optimize the synthesized code via HLS adding directives with various loop expansion approaches. In [21] an NTT HLS implementation is performed using Vivado 2018.3 on a Zynq UltraScale+ MPSoC and show a penalty of 2% to 5% for latency versus an RTL design and in [22] there is comparison between HLS-ready code using design space exploration based on directives vs. HLS block diagram design. Ozcan and Aysu [2] modularized the NTT algorithm and measured that the most computationally intensive part of it is the Butterfly section, which accounts for 78% of all cycles.…”

Section: Number Theoretic Transform (Ntt) a Definitionsmentioning

confidence: 99%

High-Level Synthesis design approach for Number-Theoretic Transform Implementations

El-Kady

Fournaris

Tsakoulis

et al. 2021

2021 IFIP/IEEE 29th International Conference on Very Large Scale Integration (VLSI-SoC)

View full text Add to dashboard Cite

Lattice-based cryptography performs polynomial multiplication using the Number Theoretic Transform (NTT), in order to reduce the polynomial multiplication complexity from O(n 2 ) to O(n log n). NTT has been in the center of investigation in cryptography space, as it is applied in many cryptography schemes such as hash functions, homomorphic encryption, keyencapsulation mechanisms, and digital signatures. A common approach for rapid production of hardware designs commences from semi-automatic software production, as supported by the Xilinx High-Level Synthesis (HLS) toolchain or similar tools. Most of the times this approach requires careful modifications (e.g. code modification, loop reordering, loop flattening, removing dependencies, loop pipelining, loop unrolling) in order to achieve a design with performance comparable to a Register-Transfer Level (RTL) hand-crafted design. In this paper a design solution is proposed that solves the data and loop-carry dependencies of the Cooley-Tukey NTT algorithm, by assisting the HLS synthesizer to produce efficient designs, in terms of latency and resources. The proposed work has been evaluated using the Dilithium digital-signature scheme NTT version (n = 256, Q of 23 bits), and is shown to achieve a 20-50% improvement in terms of latency (without really affecting the resources) compared to other existing HLS-based NTT solutions in the literature.

show abstract

“…However, using the power-of-two moduli cannot leverage the acceleration from the NTT-based polynomial multiplication without further expensive transformation. NTT-based polynomial multiplication has been widely applied in many lattice-based cryptography schemes [7], [25], [26], [27], [28], [29], [30]. The concept of NTT is to convert all the coefficients of the polynomials into the NTT-domain, which will then go through a direct coefficient-wise multiplication, and followed by an inverse NTT transform to recover the produced coefficients in the original algebraic domain polynomial.…”

Section: Modular Polynomial Multiplicationmentioning

confidence: 99%

High-Speed VLSI Architectures for Modular Polynomial Multiplication via Fast Filtering and Applications to Lattice-Based Cryptography

Tan¹,

Wang²,

Lao³

et al. 2021

Preprint

View full text Add to dashboard Cite

This paper presents a low-latency hardware accelerator for modular polynomial multiplication for lattice-based post-quantum cryptography and homomorphic encryption applications. The proposed novel modular polynomial multiplier exploits the fast finite impulse response (FIR) filter architecture to reduce the computational complexity for the schoolbook modular polynomial multiplication. We also extend this structure to fast M -parallel architectures while achieving low-latency, high-speed, and full hardware utilization. We comprehensively evaluate the performance of the proposed architectures under various polynomial settings as well as in the Saber scheme for post-quantum cryptography as a case study. The experimental results show that our design reduces the computational time and area-time product by 61% and 32%, respectively, compared to the state-of-the-art designs.

show abstract

A High-Level Synthesis Approach to the Software/Hardware Codesign of NTT-Based Post-Quantum Cryptography Algorithms

Cited by 25 publications

References 7 publications

VHDL-based Implementation Of NTT On FPGA

VHDL-based Implementation Of NTT On FPGA

High-Level Synthesis design approach for Number-Theoretic Transform Implementations

High-Speed VLSI Architectures for Modular Polynomial Multiplication via Fast Filtering and Applications to Lattice-Based Cryptography

Contact Info

Product

Resources

About