Under high noise conditions and random impacts, which constitute strong interference, models often exhibit limited capability in capturing long-term dependencies, leading to lower accuracy in predicting the Remaining Useful Life (RUL) of bearings. To address this issue, a spatiotemporal fusion network capable of ultra-long-term feature analysis is proposed to enhance the accuracy of bearing RUL prediction under substantial interference. This network utilizes a dilated convolution-based lightweight vision transformer encoder to extract spatial features reflecting the short-term degradation state of the bearing. Then, these features are sequentially fed into an adaptive tiered memory unit, based on the multiple attention mechanism and the neuron layering mechanism, to analyze temporal features indicative of long-term degradation. Subsequently, short-term spatial and long-term temporal features are fused for RUL prediction. To validate the robustness and predictive accuracy of the proposed approach under strong interference, a gearbox-rolling bearing accelerated platform is constructed, simulating high noise and random impact conditions. Experiments confirm the high robustness and predictive accuracy of the proposed method under strong interference conditions.