This paper presents faster implementations of the latticebased schemes Dilithium and Kyber on the Cortex-M4. Dilithium is one of three signature finalists in the NIST post-quantum project (NIST PQC), while Kyber is one of four key-encapsulation mechanism (KEM) finalists.Our optimizations affect the core polynomial arithmetic involving number-theoretic transforms in both schemes. Our main contributions are threefold: We present a faster signed Barrett reduction for Kyber, propose to switch to a smaller prime modulus for the polynomial multiplications cs1 and cs2 in the signing procedure of Dilithium, and apply various known optimizations to the polynomial arithmetic in both schemes. Using a smaller prime modulus is particularly interesting as it allows using the Fermat number transform resulting in especially fast code.We outperform the state-of-the-art for both Dilithium and Kyber. For Dilithium, our NTT and iNTT are faster by 5.2% and 5.7%. Switching to a smaller modulus results in speed-up of 33.1%-37.6% for the relevant operations (sum of the base multiplication and iNTT) in the signing procedure. For Kyber, the optimizations results in 15.9%-17.8% faster matrix-vector product which is a core arithmetic operation in Kyber.
The U.S. National Institute of Standards and Technology (NIST) has designated ARM microcontrollers as an important benchmarking platform for its Post-Quantum Cryptography standardization process (NISTPQC). In view of this, we explore the design space of the NISTPQC finalist Saber on the Cortex-M4 and its close relation, the Cortex-M3. In the process, we investigate various optimization strategies and memory-time tradeoffs for number-theoretic transforms (NTTs).Recent work by [Chung et al., TCHES 2021 (2)] has shown that NTT multiplication is superior compared to Toom–Cook multiplication for unprotected Saber implementations on the Cortex-M4 in terms of speed. However, it remains unclear if NTT multiplication can outperform Toom–Cook in masked implementations of Saber. Additionally, it is an open question if Saber with NTTs can outperform Toom–Cook in terms of stack usage. We answer both questions in the affirmative. Additionally, we present a Cortex-M3 implementation of Saber using NTTs outperforming an existing Toom–Cook implementation. Our stack-optimized unprotected M4 implementation uses around the same amount of stack as the most stack-optimized Toom–Cook implementation while being 33%-41% faster. Our speed-optimized masked M4 implementation is 16% faster than the fastest masked implementation using Toom–Cook. For the Cortex-M3, we outperform existing implementations by 29%-35% in speed. We conclude that for both stack- and speed-optimization purposes, one should base polynomial multiplications in Saber on the NTT rather than Toom–Cook for the Cortex-M4 and Cortex-M3. In particular, in many cases, multi-moduli NTTs perform best.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.