Visual attention mechanism has been widely used in computer vision and plays a vital role in the research of human action recognition. In this paper, we explore a novel moving target detection mechanism for human action recognition and propose a new 3D CNN (3D Convolutional Neural Network) model, dubbed LaM-2SRN (Local Features Enhanced and Moving target detected 2Stream-ResNet) for extracting and learning attention-enhanced spatiotemporal features. The contributions of this paper are as follow: First, the traditional CAM (Class Activation Map) based visual attention algorithm is used to obtain the optical flow information of the human region, thus eliminating the influence of irrelevant optical flow information (such as background clutter). Second, the ViBe algorithm is used to identify the moving target in the continuous frame, retain the optical flow information of the moving target, and make it complement with the optical flow information of the human region, to obtain a complete motion descriptor. After the motion information is marked in the video frame, we put the marked video frames and the original video frames into the 3D CNN and 2D CNN models respectively, to acquired mixed descriptor, which serves as the basis for video classification. Different from most of the previous target detection algorithms, this paper not only detects the human target, but also detects the moving foreground targets in the video, so that the pre-trained CNN model can obtain more complete motion information from videos. We use RGB-only video data for evaluation on two benchmarks: UCF101 and HMDB51, the experimental results demonstrate that our LaM-2SRN is comparable to the previous state-of-the-art algorithms. INDEX TERMS Action recognition, Video understanding, Visual attention, Background detection, CNN.