Breast cancer is the most common and deadly type of cancer in the world. Based on machine learning algorithms such as XGBoost, random forest, logistic regression, and K-nearest neighbor, this paper establishes different models to classify and predict breast cancer, so as to provide a reference for the early diagnosis of breast cancer. Recall indicates the probability of detecting malignant cancer cells in medical diagnosis, which is of great significance for the classification of breast cancer, so this article takes recall as the primary evaluation index and considers the precision, accuracy, and F1-score evaluation indicators to evaluate and compare the prediction effect of each model. In order to eliminate the influence of different dimensional concepts on the effect of the model, the data are standardized. In order to find the optimal subset and improve the accuracy of the model, 15 features were screened out as input to the model through the Pearson correlation test. The K-nearest neighbor model uses the cross-validation method to select the optimal k value by using recall as an evaluation index. For the problem of positive and negative sample imbalance, the hierarchical sampling method is used to extract the training set and test set proportionally according to different categories. The experimental results show that under different dataset division (8 : 2 and 7 : 3), the prediction effect of the same model will have different changes. Comparative analysis shows that the XGBoost model established in this paper (which divides the training set and test set by 8 : 2) has better effects, and its recall, precision, accuracy, and F1-score are 1.00, 0.960, 0.974, and 0.980, respectively.
A new density peak clustering (DPC) algorithm with adaptive clustering center based on differential privacy was proposed to solve the problems of poor adaptability of high-dimensional data, inability to automatically determine clustering centers, and privacy problems in clustering analysis. First, to solve the problem of poor adaptability of high-dimensional data, cosine distance was used to measure the similarity between high-dimensional datasets. Then, aiming at the subjective problem of clustering center selection, from the perspective of ranking graph, the weight (i − 1)/i was introduced creatively, the slope trend of ranking graph was redefined to realize the adaptive clustering center. Finally, aiming at the privacy problem, the Laplacian noise of appropriate privacy budget was added to the core statistic (local density) of the algorithm to achieve the balance between privacy protection and algorithm effectiveness. Experimental results on both the synthetic and UCI datasets show that this algorithm can not only realize the automatic selection of clustering center, but also solve the privacy problem in clustering analysis, and improve the clustering evaluation index greatly, which proves the effectiveness of the algorithm.INDEX TERMS Cosine distance, differential privacy, DPC algorithm, Laplacian noise, trend of slope change.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.