For a long-term complex Action, it is typically composed of various short-term Actions. The speed and importance of these short-term Actions directly affect the recognition results. Current two-stream neural networks have already achieved good recognition results on Action recognition datasets. However, previous two-stream networks have focused more on Action modeling, neglecting the impact of the speed and importance of different short-term Actions on the results of Action recognition. This has directly limited the model's ability to model different short-term Actions, thereby affecting the effectiveness of Action recognition. To address this issue, this paper proposes a Short-term Action Spatio-Temporal Attention (STASTA) module based on the two-stream network structure. The STASTA module is capable of focusing on the differences in importance and speed between different short-term Actions. By extracting the differences in importance and speed of different short-term Actions in the video and then fusing the features, the aim is to enrich spatio-temporal features and improve Action recognition performance. The proposed method is evaluated on the Something-Something v1 & v2 and Charades datasets. A large number of experimental results indicate that the method proposed in this paper achieves state-of-the-arts results among video Action recognition methods.