COVID-Transformer: Interpretable COVID-19 Detection Using Vision Transformer for Healthcare

Shome, Debaditya; Kar, T.; Mohanty, Sachi Nandan; Tiwari, Prayag; Muhammad, Khan; AlTameem, Abdullah; Zhang, Yazhou; Saudagar, Abdul Khader Jilani

doi:10.3390/ijerph182111086

Cited by 98 publications

(46 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Some new COVID-19 detection algorithm based on the ViT architecture has been proposed in a few research projects. Shome et al [28] built a dataset of 30,000 images and trained the ViT model on it. The trained model performed better than CNN, such as E cientNet-B0, Inception-V3, and ResNet-50 in a multi-classi cation challenge, with 92% accuracy and 98% AUC.…”

Section: Related Workmentioning

confidence: 99%

STCovidNet: Automatic Detection Model of Novel Coronavirus Pneumonia Based on Swin Transformer

Wang

Zhang

Tian

2022

Preprint

View full text Add to dashboard Cite

The novel coronavirus disease 2019 (COVID-19) has emerged as an enormous challenge facing China today. Preventive Medicine physicians and Artificial Intelligence (AI) researchers try to improve the ability to early automatic warning of coronavirus infections, promote epidemic prevention, and reduce medical costs using deep learning methods. In this work, we build an extensive database of chest computed tomography (CT) scans with image data from domestic and international open-source medical datasets. Swin Transformer is chosen as the backbone network to establish a model (STCovidNet) for the prediction of COVID-19. We then compare the performance of our technique against that of Vision Transformer (ViT) and Convolutional Neural Network (CNN). Next, to visualize our model's high-dimensional outputs in 2-dimensional space, we apply t-distributed stochastic neighbor embedding (t-SNE) as the dimension-reduction strategy. Finally, we employ gradient-weighted class activation mapping (Grad-CAM) to present a class activation map. The results indicate that STCovidNet’s performance surpasses ViT and CNN with a 0.9811 AUC and 0.9858 accuracy score. Our network outperforms previous techniques to reduce intra-class variability and generate well-separated feature embedding. The CAM figure illustrates that the decision region corresponds to radiologists' detecting spots. The suggested method can be an effective way of catching COVID-19 instances.

show abstract

Section: Related Workmentioning

confidence: 99%

STCovidNet: Automatic Detection Model of Novel Coronavirus Pneumonia Based on Swin Transformer

Wang

Zhang

Tian

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Krishnan et al [17] and Park et al [39] utilize ViT-based models to achieve higher COVID-19 classification accuracy through CXR images. COVID-Transformer [48] and xViTCOS [33] have been proposed to further improve classification accuracy and focus on diagnosis-related regions. However, there is still much room for improvement to train ViT models in a small dataset, such as medical imaging dataset.…”

Section: Vision Transformermentioning

confidence: 99%

Eye-gaze-guided Vision Transformer for Rectifying Shortcut Learning

Ma¹,

Zhang²,

Chen³

et al. 2022

Preprint

View full text Add to dashboard Cite

Learning harmful shortcuts such as spurious correlations and biases prevents deep neural networks from learning the meaningful and useful representations, thus jeopardizing the generalizability and interpretability of the learned representation. The situation becomes even more serious in medical imaging, where the clinical data (e.g., MR images with pathology) are limited and scarce while the reliability, generalizability and transparency of the learned model are highly required. To address this problem, we propose to infuse human experts' intelligence and domain knowledge into the training of deep neural networks. The core idea is that we infuse the visual attention information from expert radiologists to proactively guide the deep model to focus on regions with potential pathology and avoid being trapped in learning harmful shortcuts. To do so, we propose a novel eye-gaze-guided vision transformer (EG-ViT) for diagnosis with limited medical image data. We mask the input image patches that are out of the radiologists' interest and add an additional residual connection in the last encoder layer of EG-ViT to maintain the correlations of all patches. The experiments on two public datasets of INbreast and SIIM-ACR demonstrate our EG-ViT model can effectively learn/transfer experts' domain knowledge and achieve much better performance than baselines. Meanwhile, it successfully rectifies the harmful shortcut learning and significantly improves the EG-ViT model's interpretability. In general, EG-ViT takes the advantages of both human expert's prior knowledge and the power of deep neural networks. This work opens new avenues for advancing current artificial intelligence paradigms by infusing human intelligence.

show abstract

“…Recently, Vision Transformers (ViTs) ( Zhai et al, 2021 ) with built-in self-attention mechanisms have demonstrated comparable performance to CNNs in natural and medical visual recognition tasks, while requiring fewer computational resources. Several studies ( Liu and Yin, 2021 ; Shome et al, 2021 ; Park et al, 2022 ) used ViTs to improve pulmonary disease detection in frontal CXRs to detect manifestations consistent with COVID-19 disease. Another study ( Duong et al, 2021 ) used a ViT model to detect TB-consistent findings in frontal CXRs and obtained an accuracy of 97.72%.…”

Section: Introductionmentioning

confidence: 99%

Detecting Tuberculosis-Consistent Findings in Lateral Chest X-Rays Using an Ensemble of CNNs and Vision Transformers

2022

View full text Add to dashboard Cite

Research on detecting Tuberculosis (TB) findings on chest radiographs (or Chest X-rays: CXR) using convolutional neural networks (CNNs) has demonstrated superior performance due to the emergence of publicly available, large-scale datasets with expert annotations and availability of scalable computational resources. However, these studies use only the frontal CXR projections, i.e., the posterior-anterior (PA), and the anterior-posterior (AP) views for analysis and decision-making. Lateral CXRs which are heretofore not studied help detect clinically suspected pulmonary TB, particularly in children. Further, Vision Transformers (ViTs) with built-in self-attention mechanisms have recently emerged as a viable alternative to the traditional CNNs. Although ViTs demonstrated notable performance in several medical image analysis tasks, potential limitations exist in terms of performance and computational efficiency, between the CNN and ViT models, necessitating a comprehensive analysis to select appropriate models for the problem under study. This study aims to detect TB-consistent findings in lateral CXRs by constructing an ensemble of the CNN and ViT models. Several models are trained on lateral CXR data extracted from two large public collections to transfer modality-specific knowledge and fine-tune them for detecting findings consistent with TB. We observed that the weighted averaging ensemble of the predictions of CNN and ViT models using the optimal weights computed with the Sequential Least-Squares Quadratic Programming method delivered significantly superior performance (MCC: 0.8136, 95% confidence intervals (CI): 0.7394, 0.8878, p < 0.05) compared to the individual models and other ensembles. We also interpreted the decisions of CNN and ViT models using class-selective relevance maps and attention maps, respectively, and combined them to highlight the discriminative image regions contributing to the final output. We observed that (i) the model accuracy is not related to disease region of interest (ROI) localization and (ii) the bitwise-AND of the heatmaps of the top-2-performing models delivered significantly superior ROI localization performance in terms of mean average precision [mAP@(0.1 0.6) = 0.1820, 95% CI: 0.0771,0.2869, p < 0.05], compared to other individual models and ensembles. The code is available at https://github.com/sivaramakrishnan-rajaraman/Ensemble-of-CNN-and-ViT-for-TB-detection-in-lateral-CXR.

show abstract

COVID-Transformer: Interpretable COVID-19 Detection Using Vision Transformer for Healthcare

Cited by 98 publications

References 37 publications

STCovidNet: Automatic Detection Model of Novel Coronavirus Pneumonia Based on Swin Transformer

STCovidNet: Automatic Detection Model of Novel Coronavirus Pneumonia Based on Swin Transformer

Eye-gaze-guided Vision Transformer for Rectifying Shortcut Learning

Detecting Tuberculosis-Consistent Findings in Lateral Chest X-Rays Using an Ensemble of CNNs and Vision Transformers

Contact Info

Product

Resources

About