Given the global competitive landscape, it is imperative that businesses maintain and manage their facilities continuously to enhance efficiency and productivity for sustaining competitiveness. Hence, a new hybrid model called contrast enhancement convolutional vision transformer (CECvT) was developed in this study that enables fault diagnosis without physical contact with factory equipment to ensure accurate initial fault detection without the risk of machine damage or interference. This model leverages thermal imaging as an apt source for early anomaly detection in equipment. A new contrast enhancement module employing contrast enhancement techniques was integrated to address the edge information loss when utilizing thermal images. Moreover, the network performance was enhanced by fusing the advantages of convolutional neural network (CNN) and Transformer models. Notably, the model design allows deriving detailed feature information necessary for the initial diagnostics by harnessing multiscale information to extract and concatenate features. The proposed method's performance was evaluated using the thermal imaging dataset provided by AI Hub. When juxtaposed with CNN, Transformer, and hybrid CNN-Transformer models, the proposed model demonstrated a superior accuracy of 96.17%. Furthermore, it achieved the most accurate diagnosis at the inception of abnormalities than the other networks. The proposed model thus has potential and is preferrable for various thermal-imaging-based fault diagnosis applications owing to its excellent performance and precision during initial diagnosis.