“…The third approach is based on using parallel computing under di↵erent architectures. On multi-core architecture, the authors of [7] proposed a parallel implementation for GMM using the OpenMP framework. On GPU architecture, there are many implementations to parallelize BS using di↵erent optimization techniques (such as memory coalescing, data transfer, kernel overlapping, divergent branch elimination, and e cient register usage) [8][9][10].…”