Data protection authorities formulate policies and rules which the service providers have to comply with to ensure security and privacy when they perform Big Data analytics using users Personally Identifiable Information (PII). The knowledge contained in the data regulations and organizational privacy policies are typically maintained as short unstructured text in HTML or PDF formats. Hence it is an open challenge to determine the specific regulation rules that are being addressed by a provider's privacy policies. We have developed a semantically rich framework, using techniques from Semantic Web and Natural Language Processing, to extract and compare the context of a short text in real-time. This framework allows automated incremental text comparison and identifying context from short text policy documents by determining the semantic similarity score and extracting semantically similar key terms. Additionally, we also created a knowledge graph to store the semantically similar comparison results while evaluating our framework across EU GDPR and privacy policies of 20 organizations complying with this regulation associated with various categories apply to Big Data stored in the cloud. Our approach can be utilized by Big Data practitioners to update their referential documents regularly based on the authority documents.
Named Entity Recognition (NER) is important in the cybersecurity domain. It helps researchers extract cyber threat information from unstructured text sources. The extracted cyberentities or key expressions can be used to model a cyber-attack described in an open-source text. A large number of generalpurpose NER algorithms have been published that work well in text analysis. These algorithms do not perform well when applied to the cybersecurity domain. In the field of cybersecurity, the open-source text available varies greatly in complexity and underlying structure of the sentences. General-purpose NER algorithms can misrepresent domain-specific words, such as "malicious" and "javascript". In this paper, we compare the recent deep learningbased NER algorithms on a cybersecurity dataset. We created a cybersecurity dataset collected from various sources, including "Microsoft Security Bulletin" and "Adobe Security Updates". Some of these approaches proposed in literature were not used for Cybersecurity. Others are innovations proposed by us. This comparative study helps us identify the NER algorithms that are robust and can work well in sentences taken from a large number of cybersecurity sources. We tabulate their performance on the test set and identify the best NER algorithm for a cybersecurity corpus. We also discuss the different embedding strategies that aid in the process of NER for the chosen deep learning algorithms.
Machine Learning has increased our ability to model large quantities of data efficiently in a short time. Machine learning approaches in many application domains require collecting large volumes of data from distributed sources and combining them. However, sharing of data from multiple sources leads to concerns about privacy. Privacy regulations like European Union's General Data Protection Regulation (GDPR) have specific requirements on when and how such data can be shared. Even when there are no specific regulations, organizations may have concerns about revealing their data. For example in cybersecurity, organizations are reluctant to share their network-related data to permit machine learning-based intrusion detectors to be built. This has, in particular, hampered academic research. We need an approach to make confidential data widely available for accurate data analysis without violating the privacy of the data subjects. Privacy in shared data has been discussed in prior work focusing on anonymization and encryption of data.An alternate approach to make data available for analysis without sharing sensitive information is by replacing sensitive information with synthetic data that behave as original data for all analytical purposes. Generative Adversarial Networks (GANs) are one of the well-known models to generate synthetic samples that can have the same distributional characteristics as the original data. However, modeling tabular data using GAN is a non-trivial task. Tabular data contain a mix of categorical and continuous variables and require specialized constraints as described in the CTGAN model.In this paper, we propose a framework to generate privacypreserving synthetic data suitable for release for analytical purposes. The data is generated using the CTGAN approach, and so is analytically similar to the original dataset. To ensure that the generated data meet the privacy requirements, we use the principle of t-closeness. We ensure that the distribution of attributes in the released dataset is within a certain threshold distance from the real dataset. We also encrypt sensitive values in the final released
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.