Target tracking based on unmanned aerial vehicle (UAV) video is a significant technique in intelligent urban surveillance systems for smart city applications, such as smart transportation, road traffic monitoring, inspection of stolen vehicle, etc. In this paper, a vision-based target tracking algorithm aiming at locating UAV-captured targets, like pedestrian and vehicle, is proposed using sparse representation theory. First of all, each target candidate is sparsely represented in the subspace spanned by a joint dictionary. Then, the sparse representation coefficient is further constrained by an L2 regularization based on the temporal consistency. To cope with the partial occlusion appearing in UAV videos, a Markov Random Field (MRF)-based binary support vector with contiguous occlusion constraint is introduced to our sparse representation model. For long-term tracking, the particle filter framework along with a dynamic template update scheme is designed. Both qualitative and quantitative experiments implemented on visible (Vis) and infrared (IR) UAV videos prove that the presented tracker can achieve better performances in terms of precision rate and success rate when compared with other state-of-the-art trackers.