2020
DOI: 10.46586/tches.v2020.i3.336-357
|View full text |Cite
|
Sign up to set email alerts
|

Cortex-M4 optimizations for {R,M} LWE schemes

Abstract: This paper proposes various optimizations for lattice-based key encapsulation mechanisms (KEM) using the Number Theoretic Transform (NTT) on the popular ARM Cortex-M4 microcontroller. Improvements come in the form of a faster code using more efficient modular reductions, optimized small-degree polynomial multiplications, and more aggressive layer merging in the NTT, but also in the form of reduced stack usage. We test our optimizations in software implementations of Kyber and NewHope, both round 2 candidates i… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

2
23
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 37 publications
(25 citation statements)
references
References 7 publications
2
23
0
Order By: Relevance
“…We implement all the known speed optimizations in the literature for Cortex-M4 and Cortex-M3. On Cortex-M4, our 32-bit butterfly is from [ACC + 21] and our 16-bit butterfly is from [ABCG20]. We additionally find a slightly faster computation for the cyclic version used in the iNTT.…”
Section: Ntts For Matrixvectormulmentioning
confidence: 92%
See 1 more Smart Citation
“…We implement all the known speed optimizations in the literature for Cortex-M4 and Cortex-M3. On Cortex-M4, our 32-bit butterfly is from [ACC + 21] and our 16-bit butterfly is from [ABCG20]. We additionally find a slightly faster computation for the cyclic version used in the iNTT.…”
Section: Ntts For Matrixvectormulmentioning
confidence: 92%
“…We implement CT butterflies with s{mul, mla}{b,t}{b,t}. Furthermore, we can use sadd16 and ssub16 to do add-sub pairs in parallel [ABCG20].…”
Section: -Bit Ct Butterfliesmentioning
confidence: 99%
“…Algorithm 14 illustrates the detailed instruction sequence of the double CT butterfly, which computes a = (a top +b The follow-up instruction sequence is the same as the previous work [ABCG20]. In summary, we obtain a 7-instruction double CT butterfly for packed arguments, which reduces 2 instructions compared with the one that uses Montgomery multiplication.…”
Section: Butterfly Unitmentioning
confidence: 93%
“…However, recent reports [CHK + 21, AHKS22] state that using CT butterfly for both NTT and INTT in LBC schemes would also result in faster code. For both strategies in 16-bit NTT/INTT, one needs to compute CT/GS butterfly over two 32-bit packed integers and return two 32-bit packed results [BKS19,ABCG20] 3: smlabb t, t, q, q2 α 4: smlabb b, b, q, q2 α 5: pkhtb t, b, t, asr#16 6: usub16 b, a, t 7: uadd16 a, a, t 8: return a, b…”
Section: Butterfly Unitmentioning
confidence: 99%
“…In analogy with FFT, we can also obtain the radix-2 k NTT algorithm. Actually, in [ABCG20] [BKS19] [GOPS13] [CHK + 21], the radix-2 k NTT algorithm is realized by merging multiple layers of radix-2 NTT on the resource-constrained micro-controller platforms. This method is leveraged to reduce the cache loading and storing overheads.…”
Section: Ntt-based Multiplication Algorithmmentioning
confidence: 99%