2018
DOI: 10.1371/journal.pmed.1002701
|View full text |Cite
|
Sign up to set email alerts
|

Development and validation of machine learning models to identify high-risk surgical patients using automatically curated electronic health record data (Pythia): A retrospective, single-site study

Abstract: BackgroundPythia is an automated, clinically curated surgical data pipeline and repository housing all surgical patient electronic health record (EHR) data from a large, quaternary, multisite health institute for data science initiatives. In an effort to better identify high-risk surgical patients from complex data, a machine learning project trained on Pythia was built to predict postoperative complication risk.Methods and findingsA curated data repository of surgical outcomes was created using automated SQL … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

4
161
1

Year Published

2019
2019
2023
2023

Publication Types

Select...
7
2

Relationship

2
7

Authors

Journals

citations
Cited by 172 publications
(166 citation statements)
references
References 32 publications
4
161
1
Order By: Relevance
“…The training set was allocated the first 70% of the individuals (sorted by date of first T2D diagnosis) of the full dataset and used for three-fold cross-validated 56 hyperparameter search. Similarly to work by Corey et al 28 , the remaining (most recently diagnosed) 30% of the full dataset were randomly split into two balanced (the proportion of cases was maintained in each split) subsets: a test set which was allocated 20% of the full dataset to be used for model selection for each outcome and the validation set which was allocated the remaining 10% of the full dataset and exclusively used to present final results. For each model type, the best parameter set was chosen by first repeatedly training the model on two-thirds of the training data and evaluating it on the remaining third, and then averaging accuracy measures across the three runs.…”
Section: Dataset Split and Model Training/evaluation Proceduresmentioning
confidence: 99%
See 1 more Smart Citation
“…The training set was allocated the first 70% of the individuals (sorted by date of first T2D diagnosis) of the full dataset and used for three-fold cross-validated 56 hyperparameter search. Similarly to work by Corey et al 28 , the remaining (most recently diagnosed) 30% of the full dataset were randomly split into two balanced (the proportion of cases was maintained in each split) subsets: a test set which was allocated 20% of the full dataset to be used for model selection for each outcome and the validation set which was allocated the remaining 10% of the full dataset and exclusively used to present final results. For each model type, the best parameter set was chosen by first repeatedly training the model on two-thirds of the training data and evaluating it on the remaining third, and then averaging accuracy measures across the three runs.…”
Section: Dataset Split and Model Training/evaluation Proceduresmentioning
confidence: 99%
“…Machine learning (ML) techniques are increasingly being used to analyze electronic health record data to predict future disease onset or its future course [9][10][11][12][13] . These efforts include prediction of onset and complications of cardiovascular disease [14][15][16][17][18][19][20][21] , onset of T2D [22][23][24][25][26] , onset of kidney disease 27 , as well as prediction of postoperative outcomes [28][29][30][31][32] , birth related outcomes 33,34 , mortality 15,35,36 and hospital readmissions [37][38][39][40][41][42][43] . However, current approaches typically suffer from a number of limitations.…”
Section: Introductionmentioning
confidence: 99%
“…First and foremost, the technology infrastructure required to run models in real-time had to be built. Fortunately, at our institution, such an infrastructure to automatically curate EHR data was already in place, utilizing native functionality in addition to custom developed technologies 24 . Second, there were many differences in data element names between the retrospective training data and prospective, live EHR data.…”
Section: Challenges and Limitationsmentioning
confidence: 99%
“…Administrative data are increasingly used to this purpose, because of their variety, availability, low cost and accuracy 7–9. Such tools have been applied to screen inpatients,10 11 outpatients12 and free-living subjects5 12 13 for mortality, hospitalisation or disease-specific outcomes, aiming for personalised cures 14–16…”
Section: Introductionmentioning
confidence: 99%