Background
Tuberculosis (TB) was the leading infectious cause of mortality globally prior to COVID-19 and chest radiography has an important role in the detection, and subsequent diagnosis, of patients with this disease. The conventional experts reading has substantial within- and between-observer variability, indicating poor reliability of human readers. Substantial efforts have been made in utilizing various artificial intelligence–based algorithms to address the limitations of human reading of chest radiographs for diagnosing TB.
Objective
This systematic literature review (SLR) aims to assess the performance of machine learning (ML) and deep learning (DL) in the detection of TB using chest radiography (chest x-ray [CXR]).
Methods
In conducting and reporting the SLR, we followed the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines. A total of 309 records were identified from Scopus, PubMed, and IEEE (Institute of Electrical and Electronics Engineers) databases. We independently screened, reviewed, and assessed all available records and included 47 studies that met the inclusion criteria in this SLR. We also performed the risk of bias assessment using Quality Assessment of Diagnostic Accuracy Studies version 2 (QUADAS-2) and meta-analysis of 10 included studies that provided confusion matrix results.
Results
Various CXR data sets have been used in the included studies, with 2 of the most popular ones being Montgomery County (n=29) and Shenzhen (n=36) data sets. DL (n=34) was more commonly used than ML (n=7) in the included studies. Most studies used human radiologist’s report as the reference standard. Support vector machine (n=5), k-nearest neighbors (n=3), and random forest (n=2) were the most popular ML approaches. Meanwhile, convolutional neural networks were the most commonly used DL techniques, with the 4 most popular applications being ResNet-50 (n=11), VGG-16 (n=8), VGG-19 (n=7), and AlexNet (n=6). Four performance metrics were popularly used, namely, accuracy (n=35), area under the curve (AUC; n=34), sensitivity (n=27), and specificity (n=23). In terms of the performance results, ML showed higher accuracy (mean ~93.71%) and sensitivity (mean ~92.55%), while on average DL models achieved better AUC (mean ~92.12%) and specificity (mean ~91.54%). Based on data from 10 studies that provided confusion matrix results, we estimated the pooled sensitivity and specificity of ML and DL methods to be 0.9857 (95% CI 0.9477-1.00) and 0.9805 (95% CI 0.9255-1.00), respectively. From the risk of bias assessment, 17 studies were regarded as having unclear risks for the reference standard aspect and 6 studies were regarded as having unclear risks for the flow and timing aspect. Only 2 included studies had built applications based on the proposed solutions.
Conclusions
Findings from this SLR confirm the high potential of both ML and DL for TB detection using CXR. Future studies need to pay a close attention on 2 aspects of risk of bias, namely, the reference standard and the flow and timing aspects.
Trial Registration
PROSPERO CRD42021277155; https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=277155