Modular SIMD arithmetic in M
            <scp>athemagix</scp>

Hoeven, Joris van der; Lecerf, Grégoire; Quintin, Guillaume

doi:10.1145/2876503

Cited by 6 publications

(3 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…When we turn pseudo-reductions into reductions, we are dealing with integers that fit inside one machine-word. Therefore we have optimized our implementation by coding an SIMD version of the Barrett reduction [Barrett 1986] as explained in [Hoeven et al 2016]. Finally, to minimize cache misses we store modular matrices contiguously to each other, as ((A rem m 1 ), (A rem m 2 ), .…”

Section: Methodsmentioning

confidence: 99%

“…Note that already slightly larger moduli will allow one to substantially increase the possible bitsize of coefficients: with our forthcoming technique, we will be able to multiply polynomials up to degree 2 22 and of coefficient bitsize 2 20 using primes of 42-bits. Note that this is particularly interesting since FFT performances are almost not penalized when one uses primes up to 53 bits instead of primes of 32 bits as demonstrated in [Hoeven et al 2016].…”

Section: A:15mentioning

confidence: 99%

See 1 more Smart Citation

Simultaneous Conversions with the Residue Number System Using Linear Algebra

Doliskani

Giorgi

Lebreton

et al. 2018

ACM Trans. Math. Softw.

View full text Add to dashboard Cite

We present an algorithm for simultaneous conversion between a given set of integers and their Residue Number System representations based on linear algebra. We provide a highly optimized implementation of the algorithm that exploits the computational features of modern processors. The main application of our algorithm is matrix multiplication over integers. Our speed-up of the conversions to and from the Residue Number System significantly improves the overall running time of matrix multiplication.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: A:15mentioning

confidence: 99%

Simultaneous Conversions with the Residue Number System Using Linear Algebra

Doliskani

Giorgi

Lebreton

et al. 2018

ACM Trans. Math. Softw.

View full text Add to dashboard Cite

show abstract

“…Whilst there has been extensive research on the optimization of modular multiplications for GPUs, 17‐20 efforts to optimize modular operations for SIMD instruction sets have been somewhat more scarce and focused on the

\times

86 architecture. Examples of these efforts include optimizations for several

\times

86 CPUs supporting SSE2 (Streaming SIMD Extensions 2), 17 efficient implementations of modular operations for the SSE and AVX instruction sets (in particular SSE4.2 and AVX2 for the Barett and Montgomery methods), 21 and the implementation of an efficient modular multiplication algorithm for AVX‐512 22 . To the best of our knowledge, there has been no similar work concerning the implementation of efficient modular arithmetic methods for Arm SVE.…”

Section: Related Workmentioning

confidence: 99%

Vectorizing and distributing number‐theoretic transform to count Goldbach partitions on Arm‐based supercomputers

Jesus,

Oliveira e Silva,

Weiland

2023

Concurrency and Computation

View full text Add to dashboard Cite

SummaryIn this article, we explore the usage of scalable vector extension (SVE) to vectorize number‐theoretic transforms (NTTs). In particular, we show that 64‐bit modular arithmetic operations, including modular multiplication, can be efficiently implemented with SVE instructions. The vectorization of NTT loops and kernels involving 64‐bit modular operations was not possible in previous Arm‐based single instruction multiple data architectures since these architectures lacked crucial instructions to efficiently implement modular multiplication. We test and evaluate our SVE implementation on the A64FX processor in an HPE Apollo 80 system. Furthermore, we implement a distributed NTT for the computation of large‐scale exact integer convolutions. We evaluate this transform on HPE Apollo 70, Cray XC50, HPE Apollo 80, and HPE Cray EX systems, where we demonstrate good scalability to thousands of cores. Finally, we describe how these methods can be utilized to count the number of Goldbach partitions of all even numbers to large limits. We present some preliminary results concerning this problem, in particular a histogram of the number of Goldbach partitions of the even numbers up to 240.

show abstract

Fast interpolation of multivariate polynomials with sparse exponents

van der Hoeven,

Lecerf

2024

Journal of Complexity

View full text Add to dashboard Cite

Modular SIMD arithmetic in M athemagix

Cited by 6 publications

References 36 publications

Simultaneous Conversions with the Residue Number System Using Linear Algebra

Simultaneous Conversions with the Residue Number System Using Linear Algebra

Vectorizing and distributing number‐theoretic transform to count Goldbach partitions on Arm‐based supercomputers

Fast interpolation of multivariate polynomials with sparse exponents

Contact Info

Product

Resources

About