Recent studies have shown the potential of the Data-Efficient Image Transformer (DeiT)based transfer learning method in speech/image recognition and classification utilizing models pre-trained on image datasets. However, the use of DeiT models, especially those pre-trained on image datasets, has not yet been explored for Valvular Heart Disease (VHD) detection. This paper proposes a transfer learning methodology using the DeiT model pre-trained on image datasets for VHD classification. Additionally, we introduce a hybrid Convolution-DeiT (Conv-DeiT) architecture to further improve classification performance. The Conv-DeiT framework integrates a convolutional block with a Squeeze-and-Excitation (SE) attention mechanism to enhance the channel and spatial information within the input features before processing by the DeiT model. The proposed models were assessed using the Heart Sound Murmur (HSM) database, accessible on GitHub. Experimental results show that the DeiT-based transfer learning approach achieved an overall accuracy of 97.44%. Moreover, our Conv-DeiT method outperformed the DeiT-based transfer learning with an impressive overall accuracy of 99.44%. This study indicates the effectiveness of transfer learning using DeiT models pre-trained on image datasets for heart sound classification. Specifically, our hybrid Conv-DeiT method, which combines the convolutional block and the SE-attention mechanism, demonstrates significant advantages in this context. INDEX TERMS Valvular heart diseases detection, transfer learning, DeiT, hybrid model.