clustering is the task of identifying groups of similar subjects according to certain criteria. the AJcc staging system can be thought as a clustering mechanism that groups patients based on their disease stage. This grouping drives prognosis and influences treatment. The goal of this work is to evaluate the efficacy of machine learning algorithms to cluster the patients into discriminative groups to improve prognosis for overall survival (oS) and relapse free survival (RfS) outcomes. We apply clustering over a retrospectively collected data from 644 head and neck cancer patients including both clinical and radiomic features. in order to incorporate outcome information into the clustering process and deal with the large proportion of censored samples, the feature space was scaled using the regression coefficients fitted using a proxy dependent variable, martingale residuals, instead of follow-up time. Two clusters were identified and evaluated using cross validation. The Kaplan Meier (KM) curves between the two clusters differ significantly for OS and RFS (p-value < 0.0001). Moreover, there was a relative predictive improvement when using the cluster label in addition to the clinical features compared to using only clinical features where AUC increased by 5.7% and 13.0% for OS and RFS, respectively. Every year over 50,000 new cases of head and neck cancers are diagnosed in the United States. This number is projected to rise in the future, especially for oropharyngeal cancers, recently been associated with the incidence of HPV16 genotype infections 1. The American Joint Committee on Cancer (AJCC) and the Union for International Cancer Control, maintains an internationally used standardized TNM Staging System. This system serves as a way to systematically assess the severity of the cancer on individual subjects 2. The vast majority of risk stratification of head neck cancer patients uses staging systems that sub classify patients into four or less groups, based primarily on committee derived treatment standards and approaches using existing data sets. These consider physical examinations, imaging and laboratory tests, pathology and surgical reports, etc. Establishing the AJCC stage for a patient considers various important anatomic classifications and other risk factors that contribute to the overall assessment such as T, N and M Categories. T Category relates to the extent of the primary tumor, N Category relates to the spread to lymph nodes, and M Category indicates the spread outside the T and N related areas. These classifications play a critical role in the ultimate diagnosis and prognosis. The ability to more accurately assess the underlying condition such that it improves the prediction on various outcomes is a long-standing clinical goal. In the era of personalized cancer medicine, innovative sources of meaningful data are critically needed. For head and neck cancer, radiomics is one such "big data" approach that applies advanced image refining/data characterization algorithms to generate imaging features t...
Survival outcomes, such as overall survival or recurrence-free survival, are called right-censored because for many patients the event has not yet occurred at the last follow-up time. With an increased number of available features and relatively small number of patients and even smaller number of events, dimensionality reduction is needed to reduce the sparsity of the data and make standard approaches such as Cox Proportional Hazards (Cox) model effective. Clustering is used to identify similar groups within the data and can be thought as a dimensionality reduction technique when the cluster label is used in the analysis. Our goal is to identify similar groups of patients that exhibit the same response to treatment or expected outcomes in order to improve the prediction accuracy for new patients.In this thesis, we explore different ways of leveraging clustering for improved prognosis for head and neck cancer patients. To circumvent the rightcensoring of survival outcomes, we use the residuals from a Cox as the dependent variable for guiding clustering of the data. We propose two approaches. The first one, Supervised Scaled Clustering (SSC), uses the residuals to scale the features using a regression model before clustering the patients using K-medians and consensus clustering. The second one, Supervised Domain Clustering (SDC), considers groups of features and uses the residuals to learn the most suitable dissimilarity for clustering. Cluster labels are then used as covariates within a Cox model and/or other survival models. A rigorous experimental evaluation summarizes, compares ii and contrasts different metrics for model comparison and performance evaluation.Results show that our approaches find significantly discriminative groupings w.r.t. to the outcomes, and can serve as a feature extraction method that can improve performance while considerably reducing the dimensionality of the original feature space.iii PUBLIC ABSTRACTSurvival outcomes, such as overall survival or recurrence-free survival, are called right-censored because for many patients the event has not yet occurred at the last follow-up time. With an increasing number of potential risk factors available that can aid in improving prognosis, standard statistical modeling approaches such as Cox Proportional Hazards may not be as effective in incorporating them.Clustering is a machine learning task with the ultimate goal of identifying similar groups within the data and effectively condensing multiple risk factors represented by the cluster label. In this manner we are able to summarize the increasing number of risk factors and find labels that identify not obvious, yet salient, similarities that result from simultaneously considering these multiple risk factors. Once one or multiple groupings have been identified we evaluate how these groupings discriminate against the survival outcomes of interest. Finally we incorporate clustering into standard approaches for risk modeling and evaluate and quantify the improvement in prognosis.iv
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.