High-performance Implementation of Elliptic Curve Cryptography Using Vector Instructions

Faz-Hernández, Armando; López, Julio; Dahab, Ricardo

doi:10.1145/3309759

Cited by 28 publications

(35 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…squaring) multiplies two pairs of 25 or 26-bit limbs in parallel, whereby two limbs belonging to one operand are stored in a 128-bit lane of an AVX2 register. In a recent follow-up work, Faz-Hernández et al [8] presented fast 2-way and 4-way implementations of the field-arithmetic and point operations using both the Montgomery model model and the Edwards model of Curve25519. There are various other studies exploring the optimization of ECC for different vector instruction sets, such as Intel SSE2, Intel AVX-512, and ARM NEON, see e.g.…”

Section: Overview Of Related Work and Motivation For Our Workmentioning

confidence: 99%

“…While such a canonical radix-2 n representation of integers has the advantage that the total number of words k = m/n is minimal for the target platform, it entails a lot of carry propagation and, as a consequence, sub-optimal performance on modern 64-bit processors [1,7]. Fortunately, it is possible to avoid most of the carry propagations by using a reduced-radix representation (also referred to as redundant representation [8]), which means the number of bits per limb n is slightly less than the bitlength n of the processor's registers, e.g. n = 51 when implementing Curve25519 for a 64-bit processor.…”

Section: Preliminariesmentioning

confidence: 99%

“…Although a reducedradix representation may increase the number of limbs k = m/n versus the full-radix setting (i.e. k > k), there is typically still a net-gain in performance when taking advantage of "lazy carrying" and "lazy reduction" [8]. We will use uppercase letters to denote field elements and indexed lowercase letters for the individual limbs they consist of.…”

Section: Preliminariesmentioning

confidence: 99%

See 2 more Smart Citations

Parallel Implementation of SM2 Elliptic Curve Cryptography on Intel Processors with AVX2

Huang

Liu

et al. 2020

Information Security and Privacy

View full text Add to dashboard Cite

This paper presents an efficient and secure implementation of SM2, the Chinese elliptic curve cryptography standard that has been adopted by the International Organization of Standardization (ISO) as ISO/IEC 14888-3:2018. Our SM2 implementation uses Intel's Advanced Vector Extensions version 2.0 (AVX2), a family of three-operand SIMD instructions operating on vectors of 8, 16, 32, or 64-bit data elements in 256-bit registers, and is resistant against timing attacks. To exploit the parallel processing capabilities of AVX2, we studied the execution flows of Co-Z Jacobian point arithmetic operations and introduce a parallel 2-way Co-Z addition, Co-Z conjugate addition, and Co-Z ladder algorithm, which allow for fast Co-Z scalar multiplication. Furthermore, we developed an efficient 2-way prime-field arithmetic library using AVX2 to support our Co-Z Jacobian point operations. Both the field and the point operations utilize branch-free (i.e. constant-time) implementation techniques, which increase their ability to resist Simple Power Analysis (SPA) and timing attacks. Our software for scalar multiplication on the SM2 curve is, to our knowledge, the first constant-time implementation of the Co-Z based ladder that leverages the parallelism of AVX2.

show abstract

Section: Overview Of Related Work and Motivation For Our Workmentioning

confidence: 99%

Section: Preliminariesmentioning

confidence: 99%

Section: Preliminariesmentioning

confidence: 99%

See 1 more Smart Citation

Parallel Implementation of SM2 Elliptic Curve Cryptography on Intel Processors with AVX2

Huang

Liu

et al. 2020

Information Security and Privacy

View full text Add to dashboard Cite

show abstract

“…For example, 2-way and 4-way parallel implementations of the point addition and point doubling were presented in e.g. [3,5,7] and [7], respectively; these execute either two or four field-arithmetic operations in par-allel. Finally, there exist also implementations that combine parallelism at the field-arithmetic and point-arithmetic layer, which we characterize as (n×m)-way parallel implementations: they perform n field operations in parallel, whereby each field operation is executed in an m-way parallel fashion and uses m elements of a vector.…”

Section: Introductionmentioning

confidence: 99%

“…Finally, there exist also implementations that combine parallelism at the field-arithmetic and point-arithmetic layer, which we characterize as (n×m)-way parallel implementations: they perform n field operations in parallel, whereby each field operation is executed in an m-way parallel fashion and uses m elements of a vector. For example, Faz-Hernández et al describe in [7] a (2 × 2)-way parallel AVX2 implementation of variable-base scalar multiplication on Curve25519 that executes in 121,000 Haswell cycles or 99,400 Skylake cycles. More recently, Hisil et al [12] presented an AVX512 implementation of Curve25519 that is (4 × 2)-way parallelized (i.e.…”

Section: Introductionmentioning

confidence: 99%

High-Throughput Elliptic Curve Cryptography Using AVX2 Vector Instructions

Cheng

Großschädl

Tian

et al. 2021

Selected Areas in Cryptography

View full text Add to dashboard Cite

Single-Instruction-Multiple-Data (SIMD) extensions like Intel's AVX2 offer a great potential to accelerate elliptic curve cryptography compared to a straightforward implementation using only base x64 instructions. All existing AVX2 implementations of scalar multiplication on Curve25519 and alternative elliptic curves are optimized for low latency. We argue in this paper that many applications, most notably server-side TLS handshake processing, would benefit more from throughput-optimized implementations than latency-optimized ones. To support this argument we introduce throughput-optimized AVX2 implementations of variable-base scalar multiplication on Curve25519 and fixed-base scalar multiplication on Ed25519. Both implementations perform four scalar multiplications in parallel, whereby each scalar multiplication uses a 64-bit element of a 256-bit AVX2 vector. The field arithmetic is based on a radix-2 29 representation of the field elements, which makes it possible to execute four parallel multiplications modulo a multiple of p = 2 255 − 19 in just 88 Skylake cycles. Four variable-base scalar multiplications on Curve25519 require less than 250,000 Skylake cycles, which translates into a throughput of 32,318 scalar multiplications per second at a clock frequency of 2 GHz. For comparison, the currently best latency-optimized AVX2 implementation reaches a throughput of only about 21,000 scalar multiplications per second on the same Skylake processor.

show abstract

SECCEG: A Secure and Efficient Cryptographic Co-processor Based on Embedded GPU System

Guang

Zheng

Dong

et al. 2021

Wireless Algorithms, Systems, and Applications

View full text Add to dashboard Cite

High-performance Implementation of Elliptic Curve Cryptography Using Vector Instructions

Cited by 28 publications

References 28 publications

Parallel Implementation of SM2 Elliptic Curve Cryptography on Intel Processors with AVX2

Parallel Implementation of SM2 Elliptic Curve Cryptography on Intel Processors with AVX2

High-Throughput Elliptic Curve Cryptography Using AVX2 Vector Instructions

SECCEG: A Secure and Efficient Cryptographic Co-processor Based on Embedded GPU System

Contact Info

Product

Resources

About