Classification is an important tool with many useful applications. Fisher's linear discriminant analysis (LDA) is a traditional model-based classification method which makes use of the Gaussian distributional information. However, in the high-dimensional, low-sample-size setting, LDA cannot be directly deployed because the sample covariance is not invertible. While there are modern methods for high-dimensional data, they may not fully use the information as LDA does. Hence in some situations, it is still desirable to use a model-based method for classification. This paper exploits the potential of LDA in a more complicated data setting. In many real applications, it is costly to manually place labels on observations; consequently, often only a small portion of labeled data is available while a large number of observations are left without labels. It is a great challenge to obtain good classification performance through the labeled data alone, especially in the high-dimensional setting. In order to overcome this issue, we propose a semisupervised sparse LDA classifier to take advantage of the seemingly useless unlabeled data, which helps to boost the classification performance in some situations. A direct estimation method is used to reconstruct LDA and achieve sparsity; meanwhile we employ the difference-convex algorithm to handle the nonconvex loss function associated with the unlabeled data. Theoretical properties of the proposed classifier are studied. Our simulated examples help understand when and how the information extracted from the unlabeled data can be useful. A real data example further illustrates the usefulness of the proposed method.
Anastomotic leakage is a life-threatening complication in patients with gastric adenocarcinoma who received total or proximal gastrectomy, and there is still no model accurately predicting anastomotic leakage. In this study, we aim to develop a high-performance machine learning tool to predict anastomotic leakage in patients with gastric adenocarcinoma received total or proximal gastrectomy. A total of 1660 cases of gastric adenocarcinoma patients who received total or proximal gastrectomy in a large academic hospital from 1 January 2010 to 31 December 2019 were investigated, and these patients were randomly divided into training and testing sets at a ratio of 8:2. Four machine learning models, such as logistic regression, random forest, support vector machine, and XGBoost, were employed, and 24 clinical preoperative and intraoperative variables were included to develop the predictive model. Regarding the area under the receiver operating characteristic curve (AUC), sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and accuracy, random forest had a favorable performance with an AUC of 0.89, a sensitivity of 81.8% and specificity of 82.2% in the testing set. Moreover, we built a web app based on random forest model to achieve real-time predictions for guiding surgeons’ intraoperative decision making.
Classification and clustering are both important topics in statistical learning. A natural question herein is whether predefined classes are really different from one another, or whether clusters are really there. Specifically, we may be interested in knowing whether the two classes defined by some class labels (when they are provided), or the two clusters tagged by a clustering algorithm (where class labels are not provided), are from the same underlying distribution. Although both are challenging questions for the high-dimensional, low-sample size data, there has been some recent development for both. However, when it is costly to manually place labels on observations, it is often that only a small portion of the class labels is available. In this article, we propose a significance analysis approach for such type of data, namely partially labeled data. Our method makes use of the whole data and tries to test the class difference as if all the labels were observed. Compared to a testing method that ignores the label information, our method provides a greater power, meanwhile, maintaining the size, illustrated by a comprehensive simulation study. Theoretical properties of the proposed method are studied with emphasis on the high-dimensional, low-sample size setting.Our simulated examples help to understand when and how the information extracted from the labeled data can be effective. A real data example further illustrates the usefulness of the proposed method.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.