In 2010, the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD) started the Nulliparous Pregnancy Outcomes Study: Monitoring Mothers-to-be (nuMoM2b), a prospective cohort study of a racially/ethnically/geographically diverse population of nulliparous women with singleton gestation. The nuMoM2b is a very large dataset, consisting of data for 10,038 patients with over 4,600 features per patient, spread out over 80 files. In this report, we share our experience preparing and working with this dataset. We present our data preprocessing of the nuMoM2b dataset to get a deeper understanding of the data, extract the most relevant features, make the fewest assumptions when filling in unknown values, and reducing the dimensionality of the data. We hope this report is useful to researchers interested in building machine learning and statistical models from the nuMoM2b dataset.
Objective Preeclampsia is one of the leading causes of maternal morbidity, with consequences during and after pregnancy. Because of its diverse clinical presentation, preeclampsia is an adverse pregnancy outcome that is uniquely challenging to predict and manage. In this paper, we developed machine learning models that predict the onset of preeclampsia with severe features or eclampsia at discrete time points in a nulliparous pregnant study cohort. Materials and Methods The prospective study cohort to which we applied machine learning is the Nulliparous Pregnancy Outcomes Study: Monitoring Mothers-to-be (nuMoM2b) study, which contains information from eight clinical sites across the US. Maternal serum samples were collected for 1,857 individuals between the first and second trimesters. These patients with serum samples collected are selected as the final cohort. Results Our prediction models achieved an AUROC of 0.72 (95% CI, 0.69–0.76), 0.75 (95% CI, 0.71–0.79), and 0.77 (95% CI, 0.74–0.80), respectively, for the three visits. Our initial models were biased toward non-Hispanic black participants with a high predictive equality ratio of 1.31. We corrected this bias and reduced this ratio to 1.14. The top features stress the importance of using several tests, particularly for biomarkers and ultrasound measurements. Placental analytes were strong predictors for screening for the early onset of preeclampsia with severe features in the first two trimesters. Conclusion Experiments suggest that it is possible to create racial bias-free early screening models to predict the patients at risk of developing preeclampsia with severe features or eclampsia nulliparous pregnant study cohort.
Preeclampsia is a type of hypertension that develops during pregnancy. It is one of the leading causes for maternal morbidity with consequences during and after pregnancy. Because of its diverse clinical presentation, preeclampsia is a uniquely challenging adverse pregnancy outcome to predict and manage. In this paper, we explore preeclampsia in a nulliparous study cohort with machine learning techniques to build a model that distinguishes between participants most at risk for morbidity, those with preeclampsia with severe features or eclampsia, and the class of no pregnancy-related hypertension. We curated the dataset for this secondary analysis using only training examples that have all known biomarkers, factors, and placental analytes. We built classification models at discrete time points in pregnancy that combine risk factors for preeclampsia with severe features or eclampsia to help screen cases early in pregnancy. The time points are at 60 − 136 (V1), 160 − 216 (V2), 220 − 296 (V3) weeks gestation and at delivery (V4). We then analyzed the model prediction results and provided an interpretable report of cut-off points of the top contributing risk factors and their impact on prediction. Finally, we identified race-based biases in our models and describe how we mitigate those biases. We evaluated the results of four machine learning algorithms and found that ensemble methods outperformed non-ensemble methods. Random Forest models achieved an area under receiver operating characteristic curve at V1 of 0.68 ± 0.05, V2 of 0.73 ± 0.05, V3 of 0.76 ± 0.04 and V4 of 0.83 ± 0.03. Analyzing the Random Forest models, the features found to be most informative across all visits fall into several broad categories: weight, blood pressure measurements, uterine artery doppler measurements, diet intake and serum biomarkers. We found that our models are biased toward non-Hispanic black participants with a high predictive equality ratio of 1.31. We corrected this bias and reduced this ratio to 1.14. We also evaluated results for predictions of early cases versus late preeclampsia with severe features or eclampsia and found that placental analytes as the top contributors in model feature importance. Random Forest for this analysis achieved an area under receiver operating characteristic curve at V1 of 0.63 ± 0.11, V2 of 0.79 ± 0.11, V3 of 0.83 ± 0.08 and V4 of 0.84 ± 0.09. Our experiments suggest that it is important and possible to create screening models to predict the participants at risk of developing preeclampsia with severe features and eclampsia. The top features stress the importance of using several tests, in particular tests for biomarkers and ultrasound measurements. The models could be used as a screening tool as early as 6-13 weeks gestation to help clinicians identify participants who may subsequently develop preeclampsia, confirming the cases they suspect or identifying unsuspected cases. The proposed approach is easily adaptable to address any adverse pregnancy outcome with fairness.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.