Montgomery Multiplication on the Cell

Bos, Joppe W.; Kaihara, Marcelo E.

doi:10.1007/978-3-642-14390-8_50

Cited by 10 publications

(15 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In Figure 1, we designed multi-precision multiplication for SIMD architecture. Taking the 32-bit word with 256-bit multiplication as an example, our method works as follows 5 . Firstly, we re-organized operands by conducting transpose operation, which can efficiently shuffle inner vector by 32-bit wise.…”

Section: Cascade Operand Scanning Multiplication For Simdmentioning

confidence: 99%

“…Firstly, we re-organized operands by conducting transpose operation, which can efficiently shuffle inner vector by 32-bit wise. Instead of a normal order ((B[0], B [1]), (B [2], B [3]), (B [4], B [5]), (B [6], B [7])), we actually classify the operand as groups ((B[0], B [4]), (B [2], B [6]), (B [1], B [5]), (B [3], B [7])) for computing multiplication where each operand ranges from 0 to 2 32 − 1(0xffff ffff in hexadecimal form). Secondly, multiplication [7])) where the results are located from 0 to 2 64 −2 33 +1(0xffff fffe 0000 0001).…”

Section: Cascade Operand Scanning Multiplication For Simdmentioning

confidence: 99%

“…Various implementations, including [10], adopt a reduced-radix representation with 29 bits per word for a better handling of the carry propagation. In [5], vector instructions on the CELL microprocessor are used to perform multiplication on operands represented with a radix of 2 16 . More recently, Gueron et al [9] described an implementation for the new AVX2 SIMD platform (Intel Haswell architecture) that uses 256-bit wide vector instructions and a reduced-radix representation for faster accumulation of partial products.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Montgomery Modular Multiplication on ARM-NEON Revisited

Seo

Liu

Großschädl

et al. 2015

Information Security and Cryptology - ICISC 2014

View full text Add to dashboard Cite

Abstract. Montgomery modular multiplication constitutes the "arithmetic foundation" of modern public-key cryptography with applications ranging from RSA, DSA and Diffie-Hellman over elliptic curve schemes to pairing-based cryptosystems. The increased prevalence of SIMD-type instructions in commodity processors (e.g. Intel SSE, ARM NEON) has initiated a massive body of research on vector-parallel implementations of Montgomery modular multiplication. In this paper, we introduce the Cascade Operand Scanning (COS) method to speed up multi-precision multiplication on SIMD architectures. We developed the COS technique with the goal of reducing Read-After-Write (RAW) dependencies in the propagation of carries, which also reduces the number of pipeline stalls (i.e. bubbles). The COS method operates on 32-bit words in a row-wise fashion (similar to the operand-scanning method) and does not require a "non-canonical" representation of operands with a reduced radix. We show that two COS computations can be "coarsely" integrated into an efficient vectorized variant of Montgomery multiplication, which we call Coarsely Integrated Cascade Operand Scanning (CICOS) method. Due to our sophisticated instruction scheduling, the CICOS method reaches record-setting execution times for Montgomery modular multiplication on ARM-NEON platforms. Detailed benchmarking results obtained on an ARM Cortex-A9 and Cortex-A15 processors show that the proposed CICOS method outperforms Bos et al's implementation from SAC 2013 by up to 57% (A9) and 40% (A15), respectively.

show abstract

Section: Cascade Operand Scanning Multiplication For Simdmentioning

confidence: 99%

Section: Cascade Operand Scanning Multiplication For Simdmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Montgomery Modular Multiplication on ARM-NEON Revisited

Seo

Liu

Großschädl

et al. 2015

Information Security and Cryptology - ICISC 2014

View full text Add to dashboard Cite

show abstract

“…A parallel software approach describing systolic (a specific arrangement of processing units used in parallel computations) Montgomery multiplication is described in [10,23]. An approach using the vector instructions on the Cell microprocessor is considered in [8]. Exploiting much larger parallelism using the single instruction multiple threads paradigm, is realized by using a residue number system [14,29] as described in [4].…”

Section: Related Workmentioning

confidence: 99%

“…The research community has studied ways to reduce the latency of Montgomery multiplication by parallelizing this computation. These approaches vary from using the SIMD paradigm [8,10,18,23] to the single instruction, multiple threads paradigm using a residue number system [14,29] as described in [4,19] (see Sect. 2.3 for a more detailed overview).…”

Section: Introductionmentioning

confidence: 99%

Montgomery Multiplication Using Vector Instructions

Bos

Montgomery

Shumow

et al. 2014

Selected Areas in Cryptography -- SAC 2013

Self Cite

View full text Add to dashboard Cite

Abstract. In this paper we present a parallel approach to compute interleaved Montgomery multiplication. This approach is particularly suitable to be computed on 2-way single instruction, multiple data platforms as can be found on most modern computer architectures in the form of vector instruction set extensions. We have implemented this approach for tablet devices which run the x86 architecture (Intel Atom Z2760) using SSE2 instructions as well as devices which run on the ARM platform (Qualcomm MSM8960, NVIDIA Tegra 3 and 4) using NEON instructions. When instantiating modular exponentiation with this parallel version of Montgomery multiplication we observed a performance increase of more than a factor of 1.5 compared to the sequential implementation in OpenSSL for the classical arithmetic logic unit on the Atom platform for 2048-bit moduli.

show abstract

Efficient arithmetic on ARM‐NEON and its application for high‐speed RSA implementation

Seo

Liu

Großschädl

et al. 2016

Security Comm Networks

View full text Add to dashboard Cite

Advanced modern processors support single instruction, multiple data instructions (e.g., Intel‐AVX and ARM‐NEON) and a massive body of research on vector‐parallel implementations of modular arithmetic, which are crucial components for modern public‐key cryptography ranging from Rivest, Shamir, and Adleman (RSA), ElGamal, Digital Signature Algorithm (DSA), and elliptic curve cryptography, have been conducted. In this paper, we introduce a novel double operand scanning method to speed up multi‐precision squaring with non‐redundant representations on single instruction, multiple data architecture where the part of the operands are doubled to compute the squaring operation without read‐after‐write dependencies between source and destination variables. Afterwards, Karatsuba algorithm is applied to both multiplication and squaring operations. For modular multiplication, separated Montgomery algorithm is chosen. Finally, the Rivest, Shamir, and Adleman (RSA) implementations outperform the best‐known results on the ARM‐NEON platforms. Copyright © 2017 John Wiley & Sons, Ltd.

show abstract

Montgomery Multiplication on the Cell

Cited by 10 publications

References 10 publications

Montgomery Modular Multiplication on ARM-NEON Revisited

Montgomery Modular Multiplication on ARM-NEON Revisited

Montgomery Multiplication Using Vector Instructions

Efficient arithmetic on ARM‐NEON and its application for high‐speed RSA implementation

Contact Info

Product

Resources

About