EGFR and KRAS are the most frequently mutated genes in lung cancer, being active research topics in targeted therapy. Biopsy is the traditional method to genetically characterise a tumour. However, it is a risky procedure, painful for the patient, and, occasionally, the tumour might be inaccessible. This work aims to study and debate the nature of the relationships between imaging phenotypes and lung cancer-related mutation status. Until now, the literature has failed to point to new research directions, mainly consisting of results-oriented works in a field where there is still not enough available data to train clinically viable models. We intend to open a discussion about critical points and to present new possibilities for future radiogenomics studies. We conducted high-dimensional data visualisation and developed classifiers, which allowed us to analyse the results for EGFR and KRAS biological markers according to different combinations of input features. We show that EGFR mutation status might be correlated to CT scans imaging phenotypes, however, the same does not seem to hold true for KRAS mutation status. Also, the experiments suggest that the best way to approach this problem is by combining nodule-related features with features from other lung structures.
2/103/10 7/10 is balanced individually for each fold using SMOTE-NC, avoiding data leakage. After parameter optimisation, probabilistic outputs of each model with optimal parameters were analysed using the AUC of Receiver Operating Characteristic (ROC). ROC is a probability curve and AUC represents degree or measure of separability, telling how much model is capable of distinguishing between classes. The ROC curve is plotted with True Positive Rate (TPR) against the False Positive Rate (FPR), usually with TPR on the y-axis and FPR on the x-axis.
Experimental DesignWe designed four experiments in order to test and compare which type of input features allow to achieve better performance in gene mutation status prediction. We first trained a model that took nodule-related radiomic features as input. Then, for direct comparison purposes and to allow a modular evaluation, we split the semantic data into three parts: nodule, non-nodule and hybrid. The first one contains only nodular information, the second one contains only information external to the nodule and the third one is the combination of both. The split can be seen in detail in Table 2 of the supplementary material.
10/10