Structural displacement is an important quantity to assess the health of civil infrastructure. Vision‐based approaches using unmanned aerial vehicles (UAV) mounted with high‐resolution cameras have been proposed for this purpose. However, because the camera itself is moving with the UAV, any video obtained will contain both the motion of the structure and the motion of the camera. Planar homography can be used to eliminate the errors induced by the camera movement without the need for camera parameters. However, its direct application to large structures still has limitations, because capturing the undeformed regions, along with the measurement points on the structure, within a single image with sufficient resolution is seldom feasible. In this study, a new framework is presented to address these issues and facilitate the extraction of the structural displacement from videos taken by a UAV‐mounted camera. First, a two‐layer feedforward neural network (FNN) is adopted to obtain the image coordinates of the selected features of the structure on its stationary position, which are further used as homography features. Next, the structural displacement is estimated with the homography transformation matrix determined from the obtained homography features. Finally, the proposed approach is validated on both a six‐story shear‐building model in the laboratory and an elevator tower located in Zhongshan City, China. These results demonstrate the efficacy of the proposed approach.