“…GEMM employs a series of architecture-aware optimization strategies, such as cacheand register-level data re-use, prefetching, and vectorization that improve the hardware utilization of a program from a marginal < 1% to a near-optimal efficacy (> 90%) [83,230,207]. To leverage the highly optimized GEMM subroutine, the order of data in memory for spin configurations S α r (t) strategy that fuses the memory footprint of the element-wise operation with the compute-bound GEMM operation to hide the memory latency, is a sound solution that benefits a series of GEMMbased scientific computing and machine-learning applications [263,264]. Therefore, we delve into the black box of GEMM kernels, enabling memory-bandwidth efficient computations for "Daxpy"…”