Concrete structures are featured heavily in most modern societies. In recent years, the need to inspect those structures has been a growing concern and the automation of inspection methods is highly demanded. Acoustic methods such as the hammering test are one of the most popular non-destructive testing methods for this task. In this paper, an approach to defect detection in concrete structures with active weak supervision and visual information is proposed. Based on audio and position information, pairs of samples are actively queried to a user on their similarity. Those are used to transform the feature space into a favorable one, in a weakly supervised fashion, for clustering defect and non-defect samples, reinforced by position information. Experiments conducted in both laboratory conditions and in field conditions proved the effectiveness of the proposed method.