Currently, machine vision-based fault detection and diagnosis technology for robotic manipulators is widely used. However, traditional machine vision has difficulty identifying manipulator failure in complex environments with dim lighting and large texture differences and the influence of such factors as image motion blur caused by robotic manipulator movement. This article discusses the failure factors of mechanical manipulators and systematically analyzes various links leading to failure and the current technology limitations. First, a gradient-based semantic segmentation method is proposed to extract targets quickly and accurately for the grasped object and complex surrounding environment. Second, when the vision and grasped object have relative movement in dim environment, a multiframe image registration and fusion method are proposed to obtain high-quality, clear image data. Then, a machine-based method is adopted to learn the fault detection and diagnosis methods that fuse internal and external sensors. Finally, a physical system is built to verify the three aspects of the target extraction effect: image clarity, fault detection speed, and diagnosis accuracy, reflecting the superiority of this algorithm.