Action recognition is crucial to public security, playing an important role in pedestrian protection. Recently, numerous approaches based on deep learning models have been reported for action recognition, which mainly has problems such as poor versatility, not good robustness, low recognition effects for low resolution, and occlusion in outdoor environments. This paper addresses this issue and aims to develop an approach for multi-person action recognition in the park environment. Frame extraction operation is first used to extract images from surveillance video in real-time, then YOLOv5 is adopted to detect and segment human targets from images. Thirdly, the AlphaPose algorithm is used to estimate the keypoint heatmap of the human body according to human targets. Finally, the heatmaps are preprocessed and then fused and classified by deep learning networks to complete the multi-person action recognition. In addition, an optimization method, i.e., Heatmap-based Action Recognition Network (HARNet), for ResNet is proposed in order to overcome the difficulty of information extraction from heatmaps, where the attention mechanism is considered. The approach presented here was tested on park surveillance video with multiple different scenarios, and the corresponding results verified its performance.