Efficient Acceleration of Asymmetric Cryptography on Graphics Hardware

Harrison, Owen; Waldron, John

doi:10.1007/978-3-642-02384-2_22

Cited by 56 publications

(29 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Exploiting much larger parallelism using the single instruction multiple threads paradigm, is realized by using a residue number system [14,29] as described in [4]. This approach is implemented for the massively parallel graphics processing units in [19]. An approach based on Montgomery multiplication which allows one to split the operand into two parts, which can be processed in parallel, is called bipartite modular multiplication and is introduced in [24].…”

Section: Related Workmentioning

confidence: 99%

“…The research community has studied ways to reduce the latency of Montgomery multiplication by parallelizing this computation. These approaches vary from using the SIMD paradigm [8,10,18,23] to the single instruction, multiple threads paradigm using a residue number system [14,29] as described in [4,19] (see Sect. 2.3 for a more detailed overview).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Montgomery Multiplication Using Vector Instructions

Bos

Montgomery

Shumow

et al. 2014

Selected Areas in Cryptography -- SAC 2013

View full text Add to dashboard Cite

Abstract. In this paper we present a parallel approach to compute interleaved Montgomery multiplication. This approach is particularly suitable to be computed on 2-way single instruction, multiple data platforms as can be found on most modern computer architectures in the form of vector instruction set extensions. We have implemented this approach for tablet devices which run the x86 architecture (Intel Atom Z2760) using SSE2 instructions as well as devices which run on the ARM platform (Qualcomm MSM8960, NVIDIA Tegra 3 and 4) using NEON instructions. When instantiating modular exponentiation with this parallel version of Montgomery multiplication we observed a performance increase of more than a factor of 1.5 compared to the sequential implementation in OpenSSL for the classical arithmetic logic unit on the Atom platform for 2048-bit moduli.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Montgomery Multiplication Using Vector Instructions

Bos

Montgomery

Shumow

et al. 2014

Selected Areas in Cryptography -- SAC 2013

View full text Add to dashboard Cite

show abstract

“…However, the latter instruction is not exposed by CUDA API. To overcome this limitation, the authors of [10] propose to use slow 32-bit multiplication, while the tests from [11] show that 12-bit arithmetic is faster because modular reduction can be done in floating-point without overflow concerns.…”

Section: -Bit Modular Arithmetic On the Gpumentioning

confidence: 99%

“…As of now, the research is carried out to port the remaining algorithm stages (polynomial interpolation and the CRA) to the GPU. Modular computations still constitute a big challenge on the GPU, see [10,11]. Our algorithm uses the fast modular arithmetic developed in [1] which is based on mixing floatingpoint and integer computations, and is supported by the modified CUDA [12] compiler 1 .…”

mentioning

confidence: 99%

Modular Resultant Algorithm for Graphics Processors

Emeliyanenko

2010

Algorithms and Architectures for Parallel Processing

View full text Add to dashboard Cite

Abstract. In this paper we report on the recent progress in computing bivariate polynomial resultants on Graphics Processing Units (GPU). Given two polynomials in Z [x, y], our algorithm first maps the polynomials to a prime field. Then, each modular image is processed individually. The GPU evaluates the polynomials at a number of points and computes univariate modular resultants in parallel. The remaining "combine" stage of the algorithm is executed sequentially on the host machine. Porting this stage to the graphics hardware is an object of ongoing research. Our algorithm is based on an efficient modular arithmetic from [1]. With the theory of displacement structure we have been able to parallelize the resultant algorithm up to a very fine scale suitable for realization on the GPU. Our benchmarks show a substantial speed-up over a host-based resultant algorithm [2] from CGAL (www.cgal.org).Keywords: polynomial resultants, modular algorithm, parallel computations, graphics hardware, GPU, CUDA. OverviewPolynomial resultants play an important role in the quantifier elimination theory. They have a comprehend applied foreground including but not limited to topological study of algebraic curves, curve implitization, geometric modelling, etc. The original modular resultant algorithm was introduced by Collins [3]. It exploits the "divide-conquer-combine" strategy: two polynomials are reduced modulo sufficiently many primes and mapped to homeomorphic images be evaluating them at certain points. Then, a set of univariate resultants is computed independently for each prime, and the result is reconstructed by means of polynomial interpolation and the Chinese Remainder Algorithm (CRA). A number of parallel algorithms have been developed following this idea: those specialized for workstation networks [4] and shared memory machines [5,6]. In the essence, they differ in how the "combine" stage of the algorithm (polynomial interpolation) is realized. Unfortunately, these algorithms employ polynomial remainder sequences [7] (PRS) to compute univariate resultants. The PRS algorithm, though asymptotically quite fast, is sequential in nature. As a result, the Collins' algorithm in its original form admits only a coarse-grained parallelization which is suitable for traditional parallel platforms but not for systems with the massively-threaded architecture like GPUs (Graphics Processing Units).

show abstract

“…Cryptologic applications of GPUs have been considered before: symmetric cryptography in [33,20,56,21,44,11,18], asymmetric cryptography in [39,54,22] for RSA and in [54,1,9] for ECC, and enhancing symmetric [8] and asymmetric [7,5,6,10] cryptanalysis.…”

Section: Introductionmentioning

confidence: 99%

Cofactorization on Graphics Processing Units

Miele

Bos

Kleinjung

et al. 2014

Advanced Information Systems Engineering

View full text Add to dashboard Cite

Abstract. We show how the cofactorization step, a compute-intensive part of the relation collection phase of the number field sieve (NFS), can be farmed out to a graphics processing unit. Our implementation on a GTX 580 GPU, which is integrated with a state-of-the-art NFS implementation, can serve as a cryptanalytic co-processor for several Intel i7-3770K quad-core CPUs simultaneously. This allows those processors to focus on the memory-intensive sieving and results in more useful NFS-relations found in less time.

show abstract

Efficient Acceleration of Asymmetric Cryptography on Graphics Hardware

Cited by 56 publications

References 13 publications

Montgomery Multiplication Using Vector Instructions

Montgomery Multiplication Using Vector Instructions

Modular Resultant Algorithm for Graphics Processors

Cofactorization on Graphics Processing Units

Contact Info

Product

Resources

About