The Rules of Redaction: Identify, Protect, Review (and Repeat)

Engineering Applications of Artificial Intelligence

2017

Privacy has become a serious concern for modern Information Societies. The sensitive nature of much of the data that are daily exchanged or released to untrusted parties requires that responsible organizations undertake appropriate privacy protection measures. Nowadays, much of these data are texts (e.g., emails, messages posted in social media, healthcare outcomes, etc.) that, because of their unstructured and semantic nature, constitute a challenge for automatic data protection methods. In fact, textual documents are usually protected manually, in a process known as document redaction or sanitization. To do so, human experts identify sensitive terms (i.e., terms that may reveal identities and/or confidential information) and protect them accordingly (e.g., via removal or, preferably, generalization). To relieve experts from this burdensome task, in a previous work we introduced the theoretical basis of C-sanitization, an inherently semantic privacy model that provides the basis to the development of automatic document redaction/sanitization algorithms and offers clear and a priori privacy guarantees on data protection; even though its potential benefits C-sanitization still presents some limitations when applied to practice (mainly regarding flexibility, efficiency and accuracy). In this paper, we propose a new more flexible model, named (C, g(C))-sanitization, which enables an intuitive configuration of the trade-off between the desired level of protection (i.e., controlled information disclosure) and the preservation of the utility of the protected data (i.e., amount of semantics to be preserved). Moreover, we also present a set of technical solutions and algorithms that provide an efficient and scalable implementation of the model and improve its practical accuracy, as we also illustrate through empirical experiments

Section: Background On Plain Textual Data Protectionmentioning

confidence: 99%

“…In particular, all references to Sexually Transmitted Diseases (STDs) or HIV status should be redacted or sanitized. To do so, terms explicitly referring to these diseases and those semantically related ones such as drugs, treatments or symptoms should be identified and protected [3].…”

Section: Empirical Analysismentioning

confidence: 99%

Toward sensitive document release with privacy guarantees

Snchez

Engineering Applications of Artificial Intelligence

2017

“…Regarding the former, U.S. federal laws on medical data privacy [35,36] mandate hospitals and healthcare organizations to protect any references made to STDs and HIV status in patient medical records before releasing them to, for example, insurance companies in response to Worker's Compensation or Motor Vehicle Accident claims. To do so, those terms explicitly referring to these diseases and those semantically related ones, such as treatments or symptoms, should be protected [19]. Likewise, the EU Data Protection Directive [38] states that the information related to the religion and sexual orientation of EU citizens should be protected in order to avoid possible discrimination.…”

Section: Evaluation Data and Case Studiesmentioning

confidence: 99%

Privacy-preserving data outsourcing in the cloud via semantic data splitting

Sánchez

Computer Communications

2017

Even though cloud computing provides many intrinsic benefits (e.g., cost savings, availability, scalability, etc.), privacy concerns related to the lack of control over the storage and management of the outsourced (confidential) data still prevent many customers from migrating to the cloud. In this respect, several privacy-protection mechanisms based on a prior encryption of the data to be outsourced have been proposed. Data encryption offers robust security, but at the cost of hampering the efficiency of the service and limiting the functionalities that can be applied over the (encrypted) data stored on cloud premises. Because both efficiency and functionality are crucial advantages of cloud computing, especially in SaaS, in this paper we aim at retaining them by proposing a privacy-protection mechanism that relies on splitting (clear) data, and on the distributed storage offered by the increasingly popular notion of multiclouds. Specifically, we propose a semantically-grounded data splitting mechanism that is able to automatically detect pieces of data that may cause privacy risks and split them on local premises, so that each chunk does not incur in those risks; then, chunks of clear data are independently stored into the separate locations of a multi-cloud, so that external entities (cloud service providers and attackers) cannot have access to the whole confidential data. Because partial data are stored in clear on cloud premises, outsourced functionalities are seamlessly and efficiently supported by just broadcasting queries to the different cloud locations. To enforce a robust privacy notion, our proposal relies on a privacy model that offers a priori privacy guarantees; to ensure its feasibility, we have designed heuristic algorithms that minimize the number of cloud storage locations we need; to show its potential and generality, we have applied it to the least structured and most challenging data type: plain textual documents.

“…Its goal is to mimic and, hence, automatize the reasoning of human sanitizers with regard to semantic inferences, disclosure analysis, and protection of textual documents. To achieve that, our proposal relies on an assessment and quantification of the data semantics that human experts usually consider in document sanitization (Bier et al., ; Gordon, ). Our proposal provides the following contributions over the state of the art: In comparison with available models (Anandan et al., ; Cumby & Ghani, ), which assume that all risky terms (sensitive entities or related terms) have been identified a priori, our proposal automatizes both the detection of terms that can disclose sensitive data via semantic inferences and their protection.…”

Section: Introductionmentioning

confidence: 99%

“…Our proposal provides the following contributions over the state of the art: In comparison with available models (Anandan et al., ; Cumby & Ghani, ), which assume that all risky terms (sensitive entities or related terms) have been identified a priori, our proposal automatizes both the detection of terms that can disclose sensitive data via semantic inferences and their protection. This relieves human sanitizers from manually identifying related terms, which has been identified as one of the most difficult and time‐consuming challenges (Bier et al., ; Gordon, ). To do so, our model considers, as human sanitizers do, the semantic relationships by which terms or combinations of terms appearing in a document would disclose sensitive information via semantic inferences.…”

Section: Introductionmentioning

confidence: 99%

C‐sanitized: A privacy model for document redaction and sanitization

Sánchez

Asso for Info Science & Tech

2015

Within the current context of Information Societies, large amounts of information are daily exchanged and/or released. The sensitive nature of much of this information causes a serious privacy threat when documents are uncontrollably made available to untrusted third parties. In such cases, appropriate data protection measures should be undertaken by the responsible organization, especially under the umbrella of current legislations on data privacy. To do so, human experts are usually requested to redact or sanitize document contents. To relieve this burdensome task, this paper presents a privacy model for document redaction/sanitization, which offers several advantages over other models available in the literature. Based on the wellestablished foundations of data semantics and the information theory, our model provides a framework to develop and implement automated and inherently semantic redaction/sanitization tools. Moreover, contrary to ad-hoc redaction methods, our proposal provides a priori privacy guarantees which can be intuitively defined according to current legislations on data privacy. Empirical tests performed within the context of several use cases illustrate the applicability of our model and its ability to mimic the reasoning of human sanitizers.