Parallel modular multiplication using 512-bit advanced vector instructions

Buhrow, Benjamin; Gilbert, Barry K.; Haider, Clifton R.

doi:10.1007/s13389-021-00256-9

Cited by 6 publications

(2 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The information and latency of field multiplication in both versions are shown in Table 2, which indicates that our Karatsuba-based AVX-512F implementation outperforms the BPS variant in [BGH21]. We herein emphasize on the importance of using an optimal field multiplication in such parallel AVX-512 software of an isogeny-based cryptosystem.…”

Section: Field Multiplicationmentioning

confidence: 89%

“…Takahashi proposed both AVX-512F and IFMA implementation of 8-way Montgomery multiplication in [Tak20], but this software works on 62-bit and 52-bit operands, respectively, and not in the case of large integers. Buhrow, Gilbert, and Haider in [BGH21] presented a Block Product Scanning (BPS) variant of Montgomery multiplication, which is based on radix-2 32 representation. An 8-way 512-bit BPS variant implemented with AVX-512F takes 189 clock cycles for each instance, which translates to 1512 clock cycles for a whole 8-way implementation.…”

Section: Field Multiplicationmentioning

confidence: 99%

See 1 more Smart Citation

Batching CSIDH Group Actions using AVX-512

Cheng

Fotiadis

Großschädl

et al. 2021

TCHES

View full text Add to dashboard Cite

Commutative Supersingular Isogeny Diffie-Hellman (or CSIDH for short) is a recently-proposed post-quantum key establishment scheme that belongs to the family of isogeny-based cryptosystems. The CSIDH protocol is based on the action of an ideal class group on a set of supersingular elliptic curves and comes with some very attractive features, e.g. the ability to serve as a “drop-in” replacement for the standard elliptic curve Diffie-Hellman protocol. Unfortunately, the execution time of CSIDH is prohibitively high for many real-world applications, mainly due to the enormous computational cost of the underlying group action. Consequently, there is a strong demand for optimizations that increase the efficiency of the class group action evaluation, which is not only important for CSIDH, but also for related cryptosystems like the signature schemes CSI-FiSh and SeaSign. In this paper, we explore how the AVX-512 vector extensions (incl. AVX-512F and AVX-512IFMA) can be utilized to optimize constant-time evaluation of the CSIDH-512 class group action with the goal of, respectively, maximizing throughput and minimizing latency. We introduce different approaches for batching group actions and computing them in SIMD fashion on modern Intel processors. In particular, we present a hybrid batching technique that, when combined with optimized (8 × 1)-way prime-field arithmetic, increases the throughput by a factor of 3.64 compared to a state-of-the-art (non-vectorized) x64 implementation. On the other hand, vectorization in a 2-way fashion aimed to reduce latency makes our AVX-512 implementation of the group action evaluation about 1.54 times faster than the state-of-the-art. To the best of our knowledge, this paper is the first to demonstrate the high potential of using vector instructions to increase the throughput (resp. decrease the latency) of constant-time CSIDH.

show abstract

Section: Field Multiplicationmentioning

confidence: 89%

Section: Field Multiplicationmentioning

confidence: 99%