In high-mobility scenarios, a user’s media experience is severely constrained by the difficulty of network channel prediction, the instability of network quality, and other problems caused by the user’s fast movement, frequent base station handovers, the Doppler effect, etc. To this end, this paper proposes a video adaptive transmission architecture based on three-dimensional caching. In the temporal dimension, video data are cached to different base stations, and in the spatial dimension video data are cached to base stations, high-speed trains, and clients, thus constructing a multilevel caching architecture based on spatio-temporal attributes. Then, this paper mathematically models the media stream transmission process and summarizes the optimization problems that need to be solved. To solve the optimization problem, this paper proposes three optimization algorithms, namely, the placement algorithm based on three-dimensional caching, the video content selection algorithm for caching, and the bitrate selection algorithm. Finally, this paper builds a simulation system, which shows that the scheme proposed in this paper is more suitable for high-speed mobile networks, with better and more stable performance.