De-identification of Emergency Medical Records in French: Survey and Comparison of State-of-the-Art Automated Systems

Bourdois, Loïck; Avalos, Marta; Chenais, Gabrielle; Thiessard, Frantz; Revel, Philippe; Gil-Jardine, Cédric; Lagarde, Emmanuel

doi:10.32473/flairs.v34i1.128480

Cited by 5 publications

(13 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this work, we address this challenge by processing more than 58 document types. Furthermore, while previous work on French clinical deidentification annotates their corpus manually ( [4], [9]), our approach uses distant supervision, which reduces both the cost and time required for annotation.…”

Section: Discussionmentioning

confidence: 99%

“…To reuse such records and conduct health data-related studies, the task of deidentification has become essential ( [4], [5], [6]). This is necessary to protect the confidentiality of personal data in EHRs and comply with government regulations set in our case by the French Data Protection Authority, Commission Nationale de l'Informatique et des Libertés -(CNIL) 1 , and the General Data Protection Regulation -(GDPR) 2 .…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Automatic Deidentification of French Electronic Health Records: A Cost-Effective Approach Exploiting Distant Supervision and Deep Learning Models

azzouzi,

Coatrieux,

Bellafqira

et al. 2023

Preprint

View full text Add to dashboard Cite

Background: Electronic health records (EHRs) contain valuable information for clinical research; however, the sensitive nature of healthcare data presents security and confidentiality challenges. Deidentification is therefore essential to protect personal data in EHRs and comply with government regulations. Named entity recognition (NER) methods have been proposed to remove personal identifiers, with deep learning-based models achieving better performance. However, manual annotation of training data is time-consuming and expensive. The aim of this study was to develop an automatic deidentification pipeline for all kinds of clinical documents based on a distant supervised method to significantly reduce the cost of manual annotations and to facilitate the transfer of the deidentification pipeline to other clinical centers. Methods: We proposed an automated annotation process for French clinical deidentification, exploiting data from the eHOP clinical data warehouse(CDW) of the CHU de Rennes and national knowledge bases, as well as other features. In addition, this paper proposes an assisted data annotation solution using the Prodigy annotation tool. This approach aims to reduce the cost required to create a reference corpus for the evaluation of state-of-the-art NER models. Finally, we evaluated and compared the effectiveness of different NER methods. Results: A French deidentification dataset was developed in this work, based on EHRs provided by the eHOP CDW at Rennes University Hospital, France. The dataset was rich in terms of personal information, and the distribution of entities was quite similar in the training and test datasets. We evaluated a Bi-LSTM + CRF sequence labeling architecture, combined with Flair + FastText word embeddings, on a test set of manually annotated clinical reports. The model outperformed the other tested models with a significant F1 score of 96,96%, demonstrating the effectiveness of our automatic approach for deidentifying sensitive information. Conclusions: This study provides an automatic deidentification pipeline for clinical notes, which can facilitate the reuse of EHRs for secondary purposes such as clinical research. Our study highlights the importance of using advanced NLP techniques for effective de-identification, as well as the need for innovative solutions such as distant supervision to overcome the challenge of limited annotated data in the medical domain.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Automatic Deidentification of French Electronic Health Records: A Cost-Effective Approach Exploiting Distant Supervision and Deep Learning Models

azzouzi,

Coatrieux,

Bellafqira

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Although our own dataset contains more than 3,600 annotated documents, our experiments with varying the size of the training set led us to the conclusion that excellent performance is achieved as early as 500 annotated documents, and that performance stops increasing significantly beyond 1,000 documents (►Fig. 6).…”

Section: Size Of the Training Datasetmentioning

confidence: 99%

“…A lot of work has been done on this topic, in several languages, [1][2][3] including French. [4][5][6][7] Different scenarios have been proposed to improve the processing of this task. 4,8 Yet, there is no consensus method or protocol in the community, and more importantly it is very difficult for new actors to benefit from the experience and tools implemented by others, for several reasons.…”

Section: Introductionmentioning

confidence: 99%

Development and Validation of a Natural Language Processing Algorithm to Pseudonymize Documents in the Context of a Clinical Data Warehouse

Tannier,

Wajsbürt,

Calliger

et al. 2024

Methods Inf Med

View full text Add to dashboard Cite

Objective The objective of this study is to address the critical issue of deidentification of clinical reports to allow access to data for research purposes, while ensuring patient privacy. The study highlights the difficulties faced in sharing tools and resources in this domain and presents the experience of the Greater Paris University Hospitals (AP-HP for Assistance Publique-Hôpitaux de Paris) in implementing a systematic pseudonymization of text documents from its Clinical Data Warehouse. Methods We annotated a corpus of clinical documents according to 12 types of identifying entities and built a hybrid system, merging the results of a deep learning model as well as manual rules. Results and Discussion Our results show an overall performance of 0.99 of F1-score. We discuss implementation choices and present experiments to better understand the effort involved in such a task, including dataset size, document types, language models, or rule addition. We share guidelines and code under a 3-Clause BSD license.

show abstract

“…To reuse such records and conduct health data-related studies, the task of de-identification has become essential [4][5][6]. This is necessary to protect the confidentiality of personal data in EHRs and comply with government regulations set in our case by the French Data Protection Authority, Commission Nationale de l'Informatique et des Libertés-(CNIL), 1 and the General Data Protection Regulation-(GDPR).…”

Section: Introductionmentioning

confidence: 99%

Automatic de-identification of French electronic health records: a cost-effective approach exploiting distant supervision and deep learning models

Azzouzi,

Coatrieux,

Bellafqira

et al. 2024

BMC Med Inform Decis Mak

View full text Add to dashboard Cite

Background Electronic health records (EHRs) contain valuable information for clinical research; however, the sensitive nature of healthcare data presents security and confidentiality challenges. De-identification is therefore essential to protect personal data in EHRs and comply with government regulations. Named entity recognition (NER) methods have been proposed to remove personal identifiers, with deep learning-based models achieving better performance. However, manual annotation of training data is time-consuming and expensive. The aim of this study was to develop an automatic de-identification pipeline for all kinds of clinical documents based on a distant supervised method to significantly reduce the cost of manual annotations and to facilitate the transfer of the de-identification pipeline to other clinical centers. Methods We proposed an automated annotation process for French clinical de-identification, exploiting data from the eHOP clinical data warehouse (CDW) of the CHU de Rennes and national knowledge bases, as well as other features. In addition, this paper proposes an assisted data annotation solution using the Prodigy annotation tool. This approach aims to reduce the cost required to create a reference corpus for the evaluation of state-of-the-art NER models. Finally, we evaluated and compared the effectiveness of different NER methods. Results A French de-identification dataset was developed in this work, based on EHRs provided by the eHOP CDW at Rennes University Hospital, France. The dataset was rich in terms of personal information, and the distribution of entities was quite similar in the training and test datasets. We evaluated a Bi-LSTM + CRF sequence labeling architecture, combined with Flair + FastText word embeddings, on a test set of manually annotated clinical reports. The model outperformed the other tested models with a significant F1 score of 96,96%, demonstrating the effectiveness of our automatic approach for deidentifying sensitive information. Conclusions This study provides an automatic de-identification pipeline for clinical notes, which can facilitate the reuse of EHRs for secondary purposes such as clinical research. Our study highlights the importance of using advanced NLP techniques for effective de-identification, as well as the need for innovative solutions such as distant supervision to overcome the challenge of limited annotated data in the medical domain.

show abstract

De-identification of Emergency Medical Records in French: Survey and Comparison of State-of-the-Art Automated Systems

Cited by 5 publications

References 13 publications

Automatic Deidentification of French Electronic Health Records: A Cost-Effective Approach Exploiting Distant Supervision and Deep Learning Models

Automatic Deidentification of French Electronic Health Records: A Cost-Effective Approach Exploiting Distant Supervision and Deep Learning Models

Development and Validation of a Natural Language Processing Algorithm to Pseudonymize Documents in the Context of a Clinical Data Warehouse

Automatic de-identification of French electronic health records: a cost-effective approach exploiting distant supervision and deep learning models

Contact Info

Product

Resources

About