Portfolio Project
COVID-19 Outbreak Drivers
Python XGBoost & SHAP
Context
I built an early-warning model to flag states at risk of crossing 90% ICU utilization in the next 7 days.
Approach
- Cleaned and enriched 50k+ rows from the HHS hospital-capacity time series; added rolling stats, trends, and 1/3/7/14-day lag features.
- Trained an XGBoost classifier with class-imbalance weighting and a strict time-based train/test split.
Impact
- Used SHAP to highlight the top drivers and embedded an interactive plot in the report.
- Top driver was the share of ICU beds occupied by COVID patients.
- In the final snapshot, Utah had the highest predicted risk (6.1%).
Problem Framing
Goal: estimate the probability a state will breach 90% ICU utilization within the next 7 days.
- Target label: max(adult ICU bed utilization) over the next 7 days ≥ 0.90.
- Breaches are rare, so I focused on ranking and precision/recall tradeoffs.
- Output is a risk score meant to support decisions, not a perfect forecast.
Data and Feature Engineering
- Started with the HHS hospital-capacity time series (state × day) and handled missing data with forward-fills and pruning.
- Added rolling-window features and lag features (1/3/7/14 days) to capture trend and momentum.
- Built ratio features like ICU beds with COVID (%) to normalize across states.
Modeling and Evaluation
- Trained an XGBoost classifier with a time-based train/test split and class-imbalance handling.
- Measured ranking quality with AUROC (0.606) and PR-AUC (0.060).
- Used the model as a risk scorer (probability output) rather than a hard yes/no classifier.
Explainability (SHAP)
- Used SHAP to identify which daily metrics raise or lower the chance of an ICU crisis in the next week.
- Key takeaway: when a high share of ICU beds are already filled by COVID patients, risk rises quickly.
- Renamed feature labels in the SHAP plot to keep it stakeholder-friendly.
Operational Output
- Exported a per-state, per-day CSV of 7-day breach probabilities for monitoring and reporting.
- Most likely breach location in the final snapshot: UT on 2023-06-11 with a 7-day breach probability of 6.1%.
- Designed the workflow to be rerun as new days arrive.
What I'd Improve
- Calibrate probabilities and tune thresholds for a clear precision/recall target.
- Add outside signals (vaccination, policy, mobility, variants) to improve early warning.
- Monitor drift and retrain when feature patterns shift.