Based on the efficient inverse Cholesky factorization, we propose an implementation of OMP (called as version 0, i.e., v0) and its four memory-saving versions (i.e., the proposed v1, v2, v3 and v4). In the simulations, the proposed five versions and the existing OMP implementations have nearly the same numerical errors. Among all the OMP implementations, the proposed v0 needs the least computational complexity, and is the fastest in the simulations for almost all problem sizes. As a tradeoff between computational complexities/time and memory requirements, the proposed v1 seems to be better than all the existing ones when only considering the efficient OMP implementations storing G (i.e., the Gram matrix of the dictionary), the proposed v2 and v3 seem to be better than the only existing one when only considering the efficient implementations not storing G, and the proposed v4 seems to be better than the naive implementation that has the (known) minimum memory requirements. Moreover, all the proposed five versions only include parallelizable matrix-vector products in each iteration, and do not need any back-substitutions that are necessary in some existing efficient implementations (e.g., those utilizing the Cholesky factorization).