“…Whilst there has been extensive research on the optimization of modular multiplications for GPUs,
17‐20 efforts to optimize modular operations for SIMD instruction sets have been somewhat more scarce and focused on the
86 architecture. Examples of these efforts include optimizations for several
86 CPUs supporting SSE2 (Streaming SIMD Extensions 2), 17 efficient implementations of modular operations for the SSE and AVX instruction sets (in particular SSE4.2 and AVX2 for the Barett and Montgomery methods),
21 and the implementation of an efficient modular multiplication algorithm for AVX‐512
22 . To the best of our knowledge, there has been no similar work concerning the implementation of efficient modular arithmetic methods for Arm SVE.…”