The COVID-19 pandemic presents significant challenges due to its high transmissibility and mortality risk. Traditional diagnostic methods, such as RT-PCR, have limitations that hinder timely and accurate screening. In response, AI-powered computer-aided imaging analysis techniques have emerged as a promising alternative for COVID-19 diagnosis. In this paper, we propose a novel approach that combines the strengths of Convolutional Neural Network (CNN) and Vision Transformer (ViT) to enhance the performance of COVID-19 diagnosis models. CNN excels at capturing spatial features in medical images, while ViT leverages self-attention mechanisms inspired by human radiologists. Additionally, our approach draws inspiration from subclinical diagnosis, a collaborative process involving attending physicians and specialists, which has proven effective in achieving accurate and comprehensive diagnoses. To this end, we employ an early fusion strategy integrating CNN and ViT, then fed into a residual neural network. By fusing these complementary features, our approach achieves state-of-the-art performance in accurately identifying COVID-19 cases on two benchmark datasets: Chest X-ray and Clean-CC-CCII. This research has the potential to enable timely and accurate screening, aiding in the early detection and management of COVID-19 cases. Our findings contribute to the growing knowledge of AI-powered diagnostic techniques and demonstrate the potential for advanced imaging analysis methods to support medical professionals in combating the ongoing pandemic.