2010 Ninth International Conference on Machine Learning and Applications 2010
DOI: 10.1109/icmla.2010.78
|View full text |Cite
|
Sign up to set email alerts
|

A System for De-identifying Medical Message Board Text

Abstract: There are millions of public posts to medical message boards by users seeking support and information on a wide range of medical conditions. It has been shown that these posts can be used to gain a greater understanding of patients' experiences and concerns. As investigators continue to explore large corpora of medical discussion board data for research purposes, protecting the privacy of the members of these online communities becomes an important challenge that needs to be met. Extant entity recognition meth… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2011
2011
2015
2015

Publication Types

Select...
3
1

Relationship

2
2

Authors

Journals

citations
Cited by 4 publications
(2 citation statements)
references
References 15 publications
0
2
0
Order By: Relevance
“…The text is then tokenized around these regular expression matches and a conditional random field (CRF) trained using the CRF++ toolkit over a 1000 message sample of breast cancer posts is used to identify which tokens are likely to be either proper or usernames and should be removed. Over a sample of 500 messages from the training corpus, the de-identification module correctly removed 98.1% of all proper and usernames in the sample, and over a 500 message sample from an arthritis MMB corpus it correctly removed 93.8% of all proper and usernames in that sample (Benton et al, 2011a). In comparison, MIST, the highest scoring system in the 2006 i2b2 de-identification challenge (Uzuner et al 2007), produced a much lower recall (73.0 and 54.6%, respectively) Although the precision for the system was relatively low (67.4% for the breast cancer corpus), the majority of falsely de-identified tokens was not medically important (e.g.…”
Section: De-identificationmentioning
confidence: 99%
“…The text is then tokenized around these regular expression matches and a conditional random field (CRF) trained using the CRF++ toolkit over a 1000 message sample of breast cancer posts is used to identify which tokens are likely to be either proper or usernames and should be removed. Over a sample of 500 messages from the training corpus, the de-identification module correctly removed 98.1% of all proper and usernames in the sample, and over a 500 message sample from an arthritis MMB corpus it correctly removed 93.8% of all proper and usernames in that sample (Benton et al, 2011a). In comparison, MIST, the highest scoring system in the 2006 i2b2 de-identification challenge (Uzuner et al 2007), produced a much lower recall (73.0 and 54.6%, respectively) Although the precision for the system was relatively low (67.4% for the breast cancer corpus), the majority of falsely de-identified tokens was not medically important (e.g.…”
Section: De-identificationmentioning
confidence: 99%
“…A clear example of this phenomenon is the substantial quantity of natural language text which is created in many different arenas, ranging from field reports of intelligence agencies [11] to clinical notes in medical records [12] to microblogging over social media platforms [13]. To protect such data, there has been a significant amount of research into natural language processing (NLP) techniques to detect (and subsequently redact or substitute) identifiers [14], [15]. The most scalable versions of such techniques are rooted in machine learning methods [16], in which the publisher of the data annotates instances of identifiers in the text, and the machine attempts to learn a classifier (e.g., a grammar) to predict where such identifiers reside in a much larger corpus.…”
Section: Introductionmentioning
confidence: 99%