“…ification of a computation expressed as a set of tensor contraction expressions and transforms it into efficient parallel code. Several compile-time optimizations are incorporated into the TCE: algebraic transformations to minimize operation counts [31,32], loop fusion to reduce memory requirements [28,30,29], spacetime trade-off optimization [10], communication minimization [11], and data locality optimization [12,13] of memory-to-cache traffic.…”