User authentication and verification by gait data based on smartphones’ inertial sensors has gradually attracted increasing attention due to their compact size, portability and affordability. However, the existing approaches often require users to walk on a specific road at a normal walking speed to improve recognition accuracy. In order to recognize gaits under unconstrained conditions on where and how users walk, we proposed a Hybrid Deep Learning Network (HDLN), which combined the advantages of a long short-term memory (LSTM) network and a convolutional neural network (CNN) to reliably extract discriminative features from complex smartphone inertial data. The convergence layer of HDLN was optimized through a spatial pyramid pooling and attention mechanism. The former ensured that the gait features were extracted from more dimensions, and the latter ensured that only important gait information was processed while ignoring unimportant data. Furthermore, we developed an APP that can achieve real-time gait recognition. The experimental results showed that HDLN achieved better performance improvements than CNN, LSTM, DeepConvLSTM and CNN+LSTM by 1.9%, 2.8%, 2.0% and 1.3%, respectively. Furthermore, the experimental results indicated our model’s high scalability and strong suitability in real application scenes.