With the fast increase in the demand for location-based services and the proliferation of smartphones, the topic of indoor localization is attracting great interest. In indoor environments, users’ performed activities carry useful semantic information. These activities can then be used by indoor localization systems to confirm users’ current relative locations in a building. In this paper, we propose a deep-learning model based on a Convolutional Long Short-Term Memory (ConvLSTM) network to classify human activities within the indoor localization scenario using smartphone inertial sensor data. Results show that the proposed human activity recognition (HAR) model accurately identifies nine types of activities: not moving, walking, running, going up in an elevator, going down in an elevator, walking upstairs, walking downstairs, or going up and down a ramp. Moreover, predicted human activities were integrated within an existing indoor positioning system and evaluated in a multi-story building across several testing routes, with an average positioning error of 2.4 m. The results show that the inclusion of human activity information can reduce the overall localization error of the system and actively contribute to the better identification of floor transitions within a building. The conducted experiments demonstrated promising results and verified the effectiveness of using human activity-related information for indoor localization.