2022
DOI: 10.1038/s41746-022-00590-0
|View full text |Cite
|
Sign up to set email alerts
|

Cohort design and natural language processing to reduce bias in electronic health records research

Abstract: Electronic health record (EHR) datasets are statistically powerful but are subject to ascertainment bias and missingness. Using the Mass General Brigham multi-institutional EHR, we approximated a community-based cohort by sampling patients receiving longitudinal primary care between 2001-2018 (Community Care Cohort Project [C3PO], n = 520,868). We utilized natural language processing (NLP) to recover vital signs from unstructured notes. We assessed the validity of C3PO by deploying established risk models for … Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
36
0
1

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
1

Relationship

3
4

Authors

Journals

citations
Cited by 46 publications
(37 citation statements)
references
References 46 publications
0
36
0
1
Order By: Relevance
“…To our knowledge, this is the first example of using a transformer-based model (without pretraining from scratch) fine-tuned on clinician labels to extract numerical measurements from diagnostic text. We previously demonstrated the value of extracting 4 vital sign measurements from clinical text based on a large number of weak labels that were generated using a rule-based approach [ 16 ]. Our previous approach was based on the assumption that it would be impractical to accrue a sufficient quantity of gold-standard annotations in order to fine-tune a transformer-based approach.…”
Section: Discussionmentioning
confidence: 99%
See 3 more Smart Citations
“…To our knowledge, this is the first example of using a transformer-based model (without pretraining from scratch) fine-tuned on clinician labels to extract numerical measurements from diagnostic text. We previously demonstrated the value of extracting 4 vital sign measurements from clinical text based on a large number of weak labels that were generated using a rule-based approach [ 16 ]. Our previous approach was based on the assumption that it would be impractical to accrue a sufficient quantity of gold-standard annotations in order to fine-tune a transformer-based approach.…”
Section: Discussionmentioning
confidence: 99%
“…EWOC comprises 99,252 adults aged 18 years or older with ≥2 cardiology clinic visits within 1 to 3 years between 2000 and 2019. A broad range of EHR data are available for each individual in the cohort, including demographics, anthropometrics, vital signs, narrative notes, laboratory results, medication lists, radiology and cardiology diagnostic test results, pathology reports, and procedural and diagnostic administrative billing codes [ 16 ]. These data were processed using the JEDI Extractive Data Infrastructure [ 17 ].…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…Electronic medical records (EMRs) are widely used in hospitals, and a large amount of clinical data are generated by EMRs in daily medical activities. ( 3 ) In recent years, the Surveillance, Epidemiology, and End Results (SEER) and Medical Information Mart for Intensive Care (MIMIC) databases have been known worldwide ( 4 , 5 ), which provided clinicians with numerous clinical data to deal with. Clinicians and researchers have realized that the clinical data generated by the actual clinical practices in their own centers may help to provide more appropriate solutions for the medical problems in their daily work.…”
Section: Introductionmentioning
confidence: 99%