2015
DOI: 10.1016/j.jbi.2015.06.009
|View full text |Cite
|
Sign up to set email alerts
|

Automatic de-identification of electronic medical records using token-level and character-level conditional random fields

Abstract: De-identification, identifying and removing all protected health information (PHI) present in clinical data including electronic medical records (EMRs), is a critical step in making clinical data publicly available. The 2014 i2b2 (Center of Informatics for Integrating Biology and Bedside) clinical natural language processing (NLP) challenge sets up a track for de-identification (track 1). In this study, we propose a hybrid system based on both machine learning and rule approaches for the de-identification trac… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
70
0

Year Published

2017
2017
2023
2023

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 68 publications
(70 citation statements)
references
References 21 publications
0
70
0
Order By: Relevance
“…The representative works are three natural language processing (NLP) challenges, two organized by the Center of Informatics for Integrating Biology and Bedside (i2b2) in 2006 [2] and 2014 [3, 4, 5], and one organized by the Centers of Excellence in Genomic Science (CEGS) Neuropsychiatric Genome-scale and RDOC Individualized Domains (N-GRID) in 2016 [6]. The organizers of the three challenges provide manually annotated corpora for participants to develop various kinds of systems for de-identification [7, 8, 9, 10, 11, 12, 13, 14, 15]. …”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations
“…The representative works are three natural language processing (NLP) challenges, two organized by the Center of Informatics for Integrating Biology and Bedside (i2b2) in 2006 [2] and 2014 [3, 4, 5], and one organized by the Centers of Excellence in Genomic Science (CEGS) Neuropsychiatric Genome-scale and RDOC Individualized Domains (N-GRID) in 2016 [6]. The organizers of the three challenges provide manually annotated corpora for participants to develop various kinds of systems for de-identification [7, 8, 9, 10, 11, 12, 13, 14, 15]. …”
Section: Introductionmentioning
confidence: 99%
“…In our system, an ensemble classifier is deployed to combine the outputs of three individual machine learning-based subsystems, and a rule-based subsystem is used to identify some formulaic PHI instances. The three machine learning-based subsystems are a CRF-based system with a large number of hand-crafted features [12], a bidirectional LSTM-based system without any hand-crafted features [16, 17], and a variant of bidirectional LSTM-based system with a small quantity of hand-crafted features [18, 19]. Moreover, we also evaluate our system on the 2014 i2b2 challenge corpus and compare it with other state-of-the-art systems.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…These rules increase the system coverage by permitting the uses of more relaxed patterns and ambiguous terms. Researchers can combine the pattern and dictionary methods in the machine-learning model by using matching results as features (11, 19). This method is simple and likely optimal within the scope of the challenge.…”
Section: Discussionmentioning
confidence: 99%
“…Successful systems used the machine-learning algorithm Conditional Random Field (CRF) for labeling a sequence of tokens(11, 18, 19). The popular rule-based method was pattern-matching using a formal language such as regular expressions.…”
Section: Introductionmentioning
confidence: 99%