Zero‐shot learning, applied with vision‐language pretrained (VLP) models, is expected to be an alternative to existing deep learning models for defect detection, under insufficient dataset. However, VLP models, including contrastive language‐image pretraining (CLIP), showed fluctuated performance on prompts (inputs), resulting in research on prompt engineering—optimization of prompts for improving performance. Therefore, this study aims to identify the features of a prompt that can yield the best performance in classifying and detecting building defects using the zero‐shot and few‐shot capabilities of CLIP. The results reveal the following: (1) domain‐specific definitions are better than general definitions and images; (2) a complete sentence is better than a set of core terms; and (3) multimodal information is better than single‐modal information. The resulting detection performance using the proposed prompting method outperformed that of existing supervised models.