A long-standing challenge in pneumonia diagnosis is recognizing the pathological lung texture, especially the ground-glass appearance pathological texture. One main difficulty lies in precisely extracting and recognizing the pathological features. The patients, especially those with mild symptoms, show very little difference in lung texture, neither conventional computer vision methods nor convolutional neural networks perform well on pneumonia diagnosis based on chest X-ray (CXR) images. In the meanwhile, the Coronavirus Disease 2019 (COVID-19) pandemic continues wreaking havoc around the world, where quick and accurate diagnosis backed by CXR images is in high demand. Rather than simply recognizing the patterns, extracting feature maps from the original CXR image is what we need in the classification process. Thus, we propose a Vision Transformer (VIT)–based model called PneuNet to make an accurate diagnosis backed by channel-based attention through X-ray images of the lung, where multi-head attention is applied on channel patches rather than feature patches. The techniques presented in this paper are oriented toward the medical application of deep neural networks and VIT. Extensive experiment results show that our method can reach 94.96% accuracy in the three-categories classification problem on the test set, which outperforms previous deep learning models.