“…Consequently, the performance of common stencil computations used in GMG is typically limited by the memory bandwidth of modern architectures, as the ratio of floating point operations to data movement (i.e., flop-to-byte ratio) is usually well below the machine balance. For this reason, much research has been devoted to reducing data movement for stencil computations using techniques such as cache oblivious algorithms, time skewing, wavefront optimizations and overlapped tiling [30,22,6,7,27,35,18,29,36,8].…”