Many applications perform frequent scatter update operations to large data structures. For example, in push-style graph algorithms, processing each vertex requires updating the data of all its neighbors. Neighbors are often scattered over the whole graph, so these scatter updates have poor spatial and temporal locality. In current systems, scatter updates suffer high synchronization costs and high memory traffic. These drawbacks make push-style execution unattractive, and, when algorithms allow it, programmers gravitate towards pull-style implementations based on gather reads instead. We present PHI, a push cache hierarchy that makes scatter updates synchronization-and bandwidth-efficient. PHI adds support for pushing sparse, commutative updates from cores towards main memory. PHI adds simple compute logic at each cache level to buffer and coalesce these commutative updates throughout the hierarchy. This avoids synchronization, exploits temporal locality, and produces a load-balanced execution. Moreover, PHI exploits spatial locality by selectively deferring updates with poor spatial locality, batching them to achieve sequential main memory transfers. PHI is the first system to leverage both the temporal and spatial locality benefits of commutative scatter updates, some of which do not apply to gather reads. As a result, PHI not only makes push algorithms efficient, but makes them consistently faster than pull ones. We evaluate PHI on graph algorithms and other sparse applications processing large inputs. PHI improves performance by 4.7× on average (and by up to 11×), and reduces memory traffic by 2× (and by up to 5×). CCS CONCEPTS • Computer systems organization → Multicore architectures.