Privacy-preserving in data mining refers to the area of data mining that seeks to safeguard sensitive information from unsolicited or unsanctioned disclosure and hence protecting individual data records and their privacy. Data perturbation is a privacy preservation technique which does addition / multiplication of noise to the original data. It performs anonymization based on the data type of sensitive data. Generalization is a technique were quasi identifiers data are replaced by some other more general term. In this paper privacy protection is applied to high dimensional datasets like Adult and Census. For ranking the attributes, information gain feature subset selection method is used. The high ranking attributes with sensitive information are set as quasi identifiers of the datasets. A hybrid perturbation technique is used to perturb categorical and numeric attributes of both the datasets and the utility of the datasets is measured using accuracy on data mining functionalities. The data distortion is measured using maintenance of Rank of Features (CK) between the original and perturb datasets. Experimental results show that utility of the perturbed datasets comparable with the original dataset and the Census dataset has comparable CK value than adult dataset.
The Internet has become an important resource for mankind. Explicitly information security is an interminable domain to the present world. Hence a more potent Intrusion Detection System (IDS) should be built. Machine Learning techniques are used in developing proficient models for IDS. Imbalanced Learning is a crucial task for many classification processes. Resampling training data towards a more balanced distribution is an effective way to combat this issue. There are most prevalent techniques like under sampling and oversampling.In this paper, the issues of imbalanced data distribution and high dimensionality are addressed using a novel oversampling technique and an innovative feature selection method respectively. Our work suggests a novel hybrid algorithm, HOK-SMOTE which considers an ordered weighted averaging (OWA) approach for choosing the best features from the KDD cup 99 data set and K-Means SMOTE for imbalanced learning. Here an ensemble model is compared against the hybrid algorithm. This ensemble integrates Support Vector Machine (SVM), K Nearest Neighbor (KNN), Gaussian Naïve Bayes (GNB) and Decision Tree (DT). Then weighted average voting is applied for prediction of outputs. In this work, much Experimentationwas conducted on various oversampling techniques and traditional classifiers. The results indicate that the proposed work is the most accurate one among other ML techniques. The precision, recall, F-measure, and ROC curve show notable outcomes. Hence K-Means SMOTE in parallel with ensemble learning has given satisfactory results and a precise solution to the imbalanced learning in IDS. It is ascertained whether ensemble modeling or oversampling techniques are dominating for Intrusion data set.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.