Humans possess an intrinsic ability to hide their true emotions. Micro-expressions are subtle changes in facial muscles that are involuntary by nature and easy to hide. To address these issues, several machine and deep learning models have been proposed in the past few years. Convolution neural network (CNN) is a deep learning method that has widely been adopted in vision-related tasks due to its remarkable performance. However, CNN suffers from overfitting due to a large number of trainable parameters. Additionally, CNN cannot capture global information with respect to an input image. Furthermore, the identification of important regions for the classification of micro-expressions is a challenging task. Selfattention mechanism addresses these issues by focusing on key areas. Furthermore, specific transformers, known as vision transformers are widely explored in vision-related applications. However, existing vision transformers divide an input image into a fixed number of patches due to which local correlation of image pixels is lost. Further, a vision transformer relies on self-attention mechanism which effectively captures global dependencies but does not exploit the local spatial relationships in an image. In this work, we propose a vision transformer based on convolution patches to overcome this problem. The proposed algorithm generates c number of feature maps from input images using c filters through convolution operation. These feature maps are then applied to a transformer model as fixed-size image patches to perform classification. Thus, the proposed architecture leverage advantages of both convolutional layers and transformer, and captures both spatial information and global dependencies respectively, leading to improved performance. The performance of the proposed model is evaluated on three benchmark datasets: CASME-I, CASME-II, and SAMM and compared with state-of-the-art machine and deep learning models, which generated classification accuracy of 95.97%, 98.59%, and 100%, respectively.INDEX TERMS Facial expression recognition, deep learning, micro-expression recognition, self-attention, vision transformer.