2020
DOI: 10.1093/jamia/ocaa139
|View full text |Cite
|
Sign up to set email alerts
|

Democratizing EHR analyses with FIDDLE: a flexible data-driven preprocessing pipeline for structured clinical data

Abstract: Objective In applying machine learning (ML) to electronic health record (EHR) data, many decisions must be made before any ML is applied; such preprocessing requires substantial effort and can be labor-intensive. As the role of ML in health care grows, there is an increasing need for systematic and reproducible preprocessing techniques for EHR data. Thus, we developed FIDDLE (Flexible Data-Driven Pipeline), an open-source framework that streamlines the preprocessing of data extracted from the… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
50
0

Year Published

2020
2020
2025
2025

Publication Types

Select...
6
2
1

Relationship

1
8

Authors

Journals

citations
Cited by 56 publications
(51 citation statements)
references
References 44 publications
1
50
0
Order By: Relevance
“…Noticeably, only one of the tools reviewed directly supported getting all features for a cohort as FIBER does. 21 Providing a Python interface and working on an i2b2 star schema data format, FIBER stands out in facilitating information exchange and cohort comparability between different health organizations following this schema (eg, the JSON cohort definitions can easily be shared across institutions). Generalizability of data extraction pipelines for these institutions has always been challenging, and we anticipate FIBER to alleviate this issue.…”
Section: Discussionmentioning
confidence: 99%
“…Noticeably, only one of the tools reviewed directly supported getting all features for a cohort as FIBER does. 21 Providing a Python interface and working on an i2b2 star schema data format, FIBER stands out in facilitating information exchange and cohort comparability between different health organizations following this schema (eg, the JSON cohort definitions can easily be shared across institutions). Generalizability of data extraction pipelines for these institutions has always been challenging, and we anticipate FIBER to alleviate this issue.…”
Section: Discussionmentioning
confidence: 99%
“…Future work will explore the impact that Phenoflow has on the portability of additional types of phenotype definitions, including probabilistic definitions, the development of which is likely to leverage data processing tools such as the Flexible Data-Driven Pipeline (FIDDLE) framework 20 . In addition, future work will investigate how the multidimension annotations of the structured definition model can be leveraged in order to introduce new search and discovery capabilities into phenotype repositories.…”
Section: Discussionmentioning
confidence: 99%
“…It incorporates good practices in ML training, testing, and model evaluation (Teschendorff, 2019;Topçuoğlu et al, 2020). Furthermore, it provides data preprocessing steps based on the FIDDLE (FlexIble Data-Driven pipeLinE) framework outlined in Tang et al (Tang et al, 2020) and post-training permutation importance steps to estimate the importance of each feature in the models trained (Breiman, 2001;Fisher et al, 2018).…”
Section: Statement Of Needmentioning
confidence: 99%
“…preprocess_data() takes continuous and categorical data, re-factors categorical data into binary features, and provides options to normalize continuous data, remove features with near-zero variance, and keep only one instance of perfectly correlated features. We set the default options based on those implemented in FIDDLE ( Tang et al, 2020 ). More details on how to use preprocess_data() can be found in the accompanying vignette .…”
Section: Preprocessing Datamentioning
confidence: 99%