2011
DOI: 10.1609/aaai.v25i2.18851
|View full text |Cite
|
Sign up to set email alerts
|

A Machine Learning Based System for Semi-Automatically Redacting Documents

Abstract: Redacting text documents has traditionally been a mostly manual activity, making it expensive and prone to disclosure risks. This paper describes a semi-automated system to en- sure a specified level of privacy in text data sets. Recent work has attempted to quantify the likelihood of privacy breaches for text data. We build on these notions to provide a means of obstructing such breaches by framing it as a multi-class classification problem. Our system gives users fine-grained control ove… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2015
2015
2024
2024

Publication Types

Select...
3
3
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 32 publications
(7 citation statements)
references
References 13 publications
0
7
0
Order By: Relevance
“…Cumby and Ghani propose a sensitive data recognition technique based on machine learning that utilizes contextual semantic information to identify and detect sensitive content. 7 Chen et al propose a non-parametric Bayesian hidden Markov model based on a Dirichlet process for medical record de-identification. Without manual task-specific feature engineering, the model can perform as accurately as conditional random field (CRF) models in several categories.…”
Section: Data Identification and Desensitizationmentioning
confidence: 99%
See 1 more Smart Citation
“…Cumby and Ghani propose a sensitive data recognition technique based on machine learning that utilizes contextual semantic information to identify and detect sensitive content. 7 Chen et al propose a non-parametric Bayesian hidden Markov model based on a Dirichlet process for medical record de-identification. Without manual task-specific feature engineering, the model can perform as accurately as conditional random field (CRF) models in several categories.…”
Section: Data Identification and Desensitizationmentioning
confidence: 99%
“…Cumby and Ghani propose a sensitive data recognition technique based on machine learning that utilizes contextual semantic information to identify and detect sensitive content 7 . Chen et al propose a non‐parametric Bayesian hidden Markov model based on a Dirichlet process for medical record de‐identification.…”
Section: Related Workmentioning
confidence: 99%
“…The second type of text anonymization methods relies on on privacy-preserving data publishing (PPDP). In contrast to NLP approaches, PPDP methods (Chakaravarthy et al 2008;Cumby and Ghani 2011;Anandan et al 2012;Batet 2016, 2017) operate with an explicit account of disclosure risk and anonymize documents by enforcing a privacy model. As a result, PPDP approaches are able consider any term that may re-identify a certain entity to protect (a human subject or an organization), either individually for direct identifiers (such as the person's name or a passport) or in aggregate for quasi-identifiers (such as the combination of age, profession and postal code).…”
Section: Text Anonymization Techniquesmentioning
confidence: 99%
“…Other works consider the problem of document sanitization and security [26,[38][39][40]). Researchers have developed methods for encoding cryptographic signature schemes into PDF content and analyzing text to find semantically similar content to content marked for redaction.…”
Section: Related Workmentioning
confidence: 99%