Vision‐based autonomous inspection of concrete surface defects is crucial for efficient maintenance and rehabilitation of infrastructures and has become a research hot spot. However, most existing vision‐based inspection methods mainly focus on detecting one kind of defect in nearly uniform testing background where defects are relatively large and easily recognizable. But in the real‐world scenarios, multiple types of defects often occur simultaneously. And most of them occupy only small fractions of inspection images and are swamped in cluttered background, which easily leads to missed and false detections. In addition, the majority of the previous researches only focus on detecting defects but few of them pay attention to the geolocalization problem, which is indispensable for timely performing repair, protection, or reinforcement works. And most of them rely heavily on GPS for tracking the locations of the defects. However, this method is sometimes unreliable within infrastructures where the GPS signals are easily blocked, which causes a dramatic increase in searching costs. To address these limitations, we present a unified and purely vision‐based method denoted as defects detection and localization network, which can detect and classify various typical types of defects under challenging conditions while simultaneously geolocating the defects without requiring external localization sensors. We design a supervised deep convolutional neural network and propose novel training methods to optimize its performance on specific tasks. Extensive experiments show that the proposed method is effective with a detection accuracy of 80.7% and a localization accuracy of 86% at 0.41 s per image (at a scale of 1,200 pixels in the field test experiment), which is ideal for integration within intelligent autonomous inspection systems to provide support for practical applications.