Figure 1: Schematic representation of global reduction pipelining in Krylov subspace methods (e.g. Conjugate Gradients) for pipeline length two (l = 2). Global communication is initiated by an MPI_Iallreduce call. The reduction overlaps with the global communication and computational kernels in the next two iterations and is finalized by MPI_Wait. Optimally, a theoretical O(l) speedup over classic Krylov subspace methods is achieved.