Data Scientist Texas Children's Hospital Houston, Texas, United States
Background: Necrotizing enterocolitis (NEC) is a severe gastrointestinal emergency in preterm infants, linked to high mortality and morbidity. Although NEC prevention is challenging, reductions in NEC rates are possible. Early prediction and intervention hold promise for decreasing NEC incidence and improving outcomes. Our study explores machine learning techniques, including Random Forest and Extreme Gradient Boosting (XG Boost), for accurate and early NEC prediction. Objective: Address common challenges like missing data and class imbalance using imputation and sampling techniques, followed by exploring machine learning algorithms such as logistic regression, random forest and XG Boost to identify best possible algorithm(s) to accurately predict NEC. Design/Methods: To tackle issues of missing data and class imbalance, we utilized imputation and sampling techniques on the Vermont Oxford dataset, encompassing 3,463 preterm infants at Texas Children's Hospital from 2008 to 2022. Approval for this study was granted by Baylor College of Medicine Institutional Review Board. The Multiple Imputation Chained Equation (MICE) method handled missing data, while over-sampling (Synthetic Minority Oversampling Technique and Adaptive Synthetic or ADASYN) and under-sampling (SMOTE-TOMEK and SMOTE-ENN) addressed class imbalance. We assessed logistic regression, random forest, and XG Boost performance using metrics like AUROC, F1 score, precision, and recall (Table1, Figure2). Results: Sampling generally boosted recall scores compared to their un-sampled counterparts (Table1). Employing random forest for feature selection (Figure1), combined with over-sampling the minority class using SMOTE and ADAYSN, as well as under-sampling with SMOTE-ENN, resulted in enhanced recall scores: 0.82 for logistic regression with SMOTE, 1.00 for logistic regression with ADAYSN, and 1.00 for logistic regression with SMOTE-ENN. Across all combinations of sampling techniques and machine learning algorithms, logistic regression with ADAYSN and SMOTE-ENN demonstrated improved recall scores of 0.82 and 1.00, along with AUROC scores of 0.86.
Conclusion(s): We demonstrate that sampling techniques ADASYN, SMOTE-TOMEK and SMOTE-ENN generally improved recall scores, This suggests that further model optimization could be achieved through feature selection, particularly when using XG Boost with SMOTE-ENN. Addressing missing data and class imbalance can enhance predictive models for NEC in preterm neonates. Machine learning algorithms, incorporating relevant clinical variables early on, offer the potential for early NEC prediction and improved patient outcomes.