Tensor Processing Units (TPUs) are specialized hardware accelerators developed by Google to support large-scale machine-learning tasks, but they can also be leveraged to accelerate and scale other linear-algebra-intensive computations. In this paper we demonstrate the usage of TPUs for massively parallel, classical simulations of quantum many-body dynamics on very long timescales. We apply our methods to study the phenomenon of Floquet prethermalization, i.e., exponentially slow heating in quantum spin chains subject to high-frequency periodic driving. We simulate the dynamics of L = 34 qubits for over 10 5 Floquet periods, corresponding to 4 × 10 6 nearest-neighbor two-qubit gates. This is achieved by distributing the computation over 128 TPU cores. The circuits simulated have no additional symmetries and represent a pure-state evolution in the full 2 Ldimensional Hilbert space. We study the computational cost of the simulations, as a function of both the number of qubits and the number of TPU cores used, up to our maximum capacity of L = 40 qubits which requires 2048 TPU cores. For a 30-qubit benchmark simulation on 128 TPU cores, we find a 230× speedup in wall-clock runtime when compared to a reference multi-core CPU simulation that we take to be representative of the current standard in quantum many-body dynamics research. We also study the accumulation of errors as a function of circuit depth. Our work demonstrates that TPUs can offer significant advantages for state-of-the-art simulations of quantum many-body dynamics.