Learning to identify Protected Health Information by integrating knowledge- and data-driven algorithms: A case study on psychiatric evaluation notes

Dehghan, Azad; Kovačević, Aleksandar; Karystianis, George; Keane, John; Nenadić, Goran

doi:10.1016/j.jbi.2017.06.005

Cited by 9 publications

(6 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While automated methods for stripping text of identifiers (of both patients and third parties) exist, they are not perfect, performing at 81%-99% sensitivity (recall) and 43%-99% precision, 23 24 and consequently many data custodians refuse to share text outside of the clinical environment. In contrast, the few UK research groups that are situated within healthcare trusts and can access medical text which remains within the clinical environment, have established good track records in terms of technology development, 25 protecting patient privacy 26 and generating clinical insights. [27][28][29]…”

Section: Why Is Medical Free Text Important For Research?mentioning

confidence: 99%

Should free-text data in electronic medical records be shared for research? A citizens’ jury study in the UK

Ford

Oswald²,

Hassan

et al. 2020

J Med Ethics

Self Cite

View full text Add to dashboard Cite

BackgroundUse of routinely collected patient data for research and service planning is an explicit policy of the UK National Health Service and UK government. Much clinical information is recorded in free-text letters, reports and notes. These text data are generally lost to research, due to the increased privacy risk compared with structured data. We conducted a citizens’ jury which asked members of the public whether their medical free-text data should be shared for research for public benefit, to inform an ethical policy.MethodsEighteen citizens took part over 3 days. Jurors heard a range of expert presentations as well as arguments for and against sharing free text, and then questioned presenters and deliberated together. They answered a questionnaire on whether and how free text should be shared for research, gave reasons for and against sharing and suggestions for alleviating their concerns.ResultsJurors were in favour of sharing medical data and agreed this would benefit health research, but were more cautious about sharing free-text than structured data. They preferred processing of free text where a computer extracted information at scale. Their concerns were lack of transparency in uses of data, and privacy risks. They suggested keeping patients informed about uses of their data, and giving clear pathways to opt out of data sharing.ConclusionsInformed citizens suggested a transparent culture of research for the public benefit, and continuous improvement of technology to protect patient privacy, to mitigate their concerns regarding privacy risks of using patient text data.

show abstract

Section: Why Is Medical Free Text Important For Research?mentioning

confidence: 99%

Should free-text data in electronic medical records be shared for research? A citizens’ jury study in the UK

Ford

Oswald²,

Hassan

et al. 2020

J Med Ethics

Self Cite

View full text Add to dashboard Cite

show abstract

“…Psychiatric notes were used mainly in an NLP community challenge to extract protected health information and symptom severity [23,27,42,53,58,65,78,83,92]. These narratives are key enablers of mental health informatics as the fine-grained context of actionable information does not readily lend itself to predefined coding schemes.…”

Section: Types Of Narrativesmentioning

confidence: 99%

“…Similarly, as a subtask of IE, NER can be used to support structuring text into predefined templates, whose slots need to be filled with named entities of relevant types. The majority of NER studies were related to NLP community challenges such as those described in studies by Uzuner et al [123], Suominen et al [126], and Stubbs et al [131] [20,49,67,96,104]; disorders [54,57,88,98,114]; and protected health information [27,58,65]. Unlike NER, the more complex task of IE found a wider variety of clinical applications, the most prominent of which include prognosis and care improvement.…”

Section: Clinical Applicationsmentioning

confidence: 99%

Clinical Text Data in Machine Learning: Systematic Review

Spasić¹,

Nenadić²

2020

JMIR Med Inform

Self Cite

221

132

View full text Add to dashboard Cite

Background Clinical narratives represent the main form of communication within health care, providing a personalized account of patient history and assessments, and offering rich information for clinical decision making. Natural language processing (NLP) has repeatedly demonstrated its feasibility to unlock evidence buried in clinical narratives. Machine learning can facilitate rapid development of NLP tools by leveraging large amounts of text data. Objective The main aim of this study was to provide systematic evidence on the properties of text data used to train machine learning approaches to clinical NLP. We also investigated the types of NLP tasks that have been supported by machine learning and how they can be applied in clinical practice. Methods Our methodology was based on the guidelines for performing systematic reviews. In August 2018, we used PubMed, a multifaceted interface, to perform a literature search against MEDLINE. We identified 110 relevant studies and extracted information about text data used to support machine learning, NLP tasks supported, and their clinical applications. The data properties considered included their size, provenance, collection methods, annotation, and any relevant statistics. Results The majority of datasets used to train machine learning models included only hundreds or thousands of documents. Only 10 studies used tens of thousands of documents, with a handful of studies utilizing more. Relatively small datasets were utilized for training even when much larger datasets were available. The main reason for such poor data utilization is the annotation bottleneck faced by supervised machine learning algorithms. Active learning was explored to iteratively sample a subset of data for manual annotation as a strategy for minimizing the annotation effort while maximizing the predictive performance of the model. Supervised learning was successfully used where clinical codes integrated with free-text notes into electronic health records were utilized as class labels. Similarly, distant supervision was used to utilize an existing knowledge base to automatically annotate raw text. Where manual annotation was unavoidable, crowdsourcing was explored, but it remains unsuitable because of the sensitive nature of data considered. Besides the small volume, training data were typically sourced from a small number of institutions, thus offering no hard evidence about the transferability of machine learning models. The majority of studies focused on text classification. Most commonly, the classification results were used to support phenotyping, prognosis, care improvement, resource management, and surveillance. Conclusions We identified the data annotation bottleneck as one of the key obstacles to machine learning approaches in clinical NLP. Active learning and distant supervision were explored as a way of saving the annotation efforts. Future research in this field would benefit from alternatives such as data augmentation and transfer learning, or unsupervised learning, which do not require data annotation.

show abstract

“…In the 2016 i2b2 shared task, ensemble with rule-based models became more popular. Lee et al [12], Dehghan et al [13], Bui et al [14], and Liu et al [15] all employed rule-based models as a component of their hybrid systems. However, despite the wide use of rules, all the works did not investigate the effect of rule-based models in hybrid architecture.…”

Section: Prior Workmentioning

confidence: 99%

Re-examination of Rule-Based Methods in Deidentification of Electronic Health Records: Algorithm Development and Validation

et al. 2020

View full text Add to dashboard Cite

Background Deidentification of clinical records is a critical step before their publication. This is usually treated as a type of sequence labeling task, and ensemble learning is one of the best performing solutions. Under the framework of multi-learner ensemble, the significance of a candidate rule-based learner remains an open issue. Objective The aim of this study is to investigate whether a rule-based learner is useful in a hybrid deidentification system and offer suggestions on how to build and integrate a rule-based learner. Methods We chose a data-driven rule-learner named transformation-based error-driven learning (TBED) and integrated it into the best performing hybrid system in this task. Results On the popular Informatics for Integrating Biology and the Bedside (i2b2) deidentification data set, experiments showed that TBED can offer high performance with its generated rules, and integrating the rule-based model into an ensemble framework, which reached an F1 score of 96.76%, achieved the best performance reported in the community. Conclusions We proved the rule-based method offers an effective contribution to the current ensemble learning approach for the deidentification of clinical records. Such a rule system could be automatically learned by TBED, avoiding the high cost and low reliability of manual rule composition. In particular, we boosted the ensemble model with rules to create the best performance of the deidentification of clinical records.

show abstract

Learning to identify Protected Health Information by integrating knowledge- and data-driven algorithms: A case study on psychiatric evaluation notes

Cited by 9 publications

References 19 publications

Should free-text data in electronic medical records be shared for research? A citizens’ jury study in the UK

Should free-text data in electronic medical records be shared for research? A citizens’ jury study in the UK

Clinical Text Data in Machine Learning: Systematic Review

Re-examination of Rule-Based Methods in Deidentification of Electronic Health Records: Algorithm Development and Validation

Contact Info

Product

Resources

About