2022
DOI: 10.3390/info13010027
|View full text |Cite
|
Sign up to set email alerts
|

Automatic Curation of Court Documents: Anonymizing Personal Data

Abstract: In order to provide open access to data of public interest, it is often necessary to perform several data curation processes. In some cases, such as biological databases, curation involves quality control to ensure reliable experimental support for biological sequence data. In others, such as medical records or judicial files, publication must not interfere with the right to privacy of the persons involved. There are also interventions in the published data with the aim of generating metadata that enable a bet… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0
2

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3

Relationship

0
6

Authors

Journals

citations
Cited by 10 publications
(7 citation statements)
references
References 40 publications
0
5
0
2
Order By: Relevance
“…Other domains such as legal texts or public administrations can benefit from the insights of this work, even if the documents are different in nature and some points are specific. 25,26 The code we provide with this article uses data in OMOP format, which focuses on medical data only. 13 This makes it easier to adapt to other data warehouses using the same data model.…”
Section: Domain Specificitiesmentioning
confidence: 99%
“…Other domains such as legal texts or public administrations can benefit from the insights of this work, even if the documents are different in nature and some points are specific. 25,26 The code we provide with this article uses data in OMOP format, which focuses on medical data only. 13 This makes it easier to adapt to other data warehouses using the same data model.…”
Section: Domain Specificitiesmentioning
confidence: 99%
“…Vista la natura degli identificatori che devono essere oscurati è abbastanza naturale che vari approcci per l'anonimizzazione si basino in modo significativo su tecniche di NER. In particolare, approcci per NER basati su apprendimento automatico sono stati utilizzati impiegando, ad esempio, Support Vector Machine (SVM), Conditional Random Field (CRF) e reti neurali ricorrenti (Garat and Wonsever 2022).…”
Section: Metodi Basati Su Tecniche DI Apprendimento Automaticounclassified
“…Sensitive data, such as medical histories, newspapers, conversations, reports, agreements, etc., are mostly enclosed in document form. In recent years, anonymization of document data has become a very hot research topic [37], [38]. Various techniques, from natural language processing (named entity recognition) combined with clustering concepts (e.g., k-means) are employed to anonymize textual data of documents.…”
Section: E Cams For Document Datamentioning
confidence: 99%
“…Can et al [78] Practical Ensures protection based on distinct privacy-related preferences provided by users to control anonymity Can lead to higher information loss if data is imbalanced, and values of most QIs are close Meisam et al [79] Practical Preserving both privacy and utility by creating k views of the trace data Can lead to higher computing cost when the dataset is large, utility can be poor when data is skewed Fan et al [80] Practical Effectively preserve the privacy of network flows data by creating synthetic data using GANs Can lead to higher utility loss when offset between original and synthetic data is high Meisam et al [81] Practical Preserves privacy of important fields in trace data using pseudonyms and Multiview approach Yields higher computing complexity by creating multiple views of data, and prone to linking attack Aleroud et al [82] Practical A DP-based prototype to address the privacy-utility trade-off in network trace data Subject to personal information disclosure in the presence of auxiliary information Ahmed et al [83] Practical Strong privacy protection of critical fields in network logs data using condensation-based approach Prone to low utility results on special purpose metrics (i.e., accuracy, F1, etc.) of data mining Velarde et al [84] Practical Practical solution for anonymization of traffic trace data with better privacy using entropy approaches Yields poor utility when most of the data belongs to distinct regions and # of fields are large Shaham et al [85] Practical Strong privacy protection in location data sharing, and applicable to medical records and web analysis Less resilience against knowledge graph-powered attacks as well linking attack using auxiliary data Mahajan et al [86] Conceptual Enables keyword searches on encrypted data with better privacy using k-means clustering approach Does not provide provable utility in terms of information loss, accuracy, and F1 score on diverse data Li et al [87] Practical A low-cost solution for extracting and anonymizing sensitive data items from documents Lack of validation and testing on real-world (i.e., PHI data) and large scale medical documents Li et al [88] Practical Developed a practical solution for identifying, summarizing, and report generation from health data Prone to repeated query attacks using same noise for some queries, and can reveal true values Kong et al [89] Practical Privacy preservation of documents and multiple data items including features, metadata, and text Evaluation was conducted on static data, there exist a possibility of true value disclosure Garat et al [90] Practical A corpus-based method for privacy preservation of court documents and sensitive data items in them Requires a very large # of documents (e.g., up to 80K) for good performance, and complexity is high Li et al [91] Practical Robust privacy protection of medical data by concealing potentially identifying health data items Poor utility when original data to be anonymized is in scattered form, and values are highly dissimilar Lima et al…”
Section: Ref Study Nature Strengths Weaknessesmentioning
confidence: 99%