Human Pose Estimation (HPE) is the task that aims to predict the location of human joints from images and videos. This task is used in many applications, such as sports analysis and surveillance systems. Recently, several studies have embraced deep learning to enhance the performance of HPE tasks. However, building an efficient HPE model is difficult; many challenges, like crowded scenes and occlusion, must be handled. This paper followed a systematic procedure to review different HPE models comprehensively. About 100 articles published since 2014 on HPE using deep learning were selected using several selection criteria. Both image and video data types of methods were investigated. Furthermore, both single and multiple HPE methods were reviewed. In addition, the available datasets, different loss functions used in HPE, and pretrained feature extraction models were all covered. Our analysis revealed that Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) are the most used in HPE. Moreover, occlusion and crowd scenes remain the main problems affecting models’ performance. Therefore, the paper presented various solutions to address these issues. Finally, this paper highlighted the potential opportunities for future work in this task.