Portfolio Project

COVID-19 Outbreak Drivers

Python XGBoost & SHAP

Analytics Python AWS

Context

I built an early-warning model to flag states at risk of crossing 90% ICU utilization in the next 7 days.

Approach

  • Cleaned and enriched 50k+ rows from the HHS hospital-capacity time series; added rolling stats, trends, and 1/3/7/14-day lag features.
  • Trained an XGBoost classifier with class-imbalance weighting and a strict time-based train/test split.

Impact

  • Used SHAP to highlight the top drivers and embedded an interactive plot in the report.
  • Top driver was the share of ICU beds occupied by COVID patients.
  • In the final snapshot, Utah had the highest predicted risk (6.1%).

Problem Framing

Goal: estimate the probability a state will breach 90% ICU utilization within the next 7 days.

  • Target label: max(adult ICU bed utilization) over the next 7 days ≥ 0.90.
  • Breaches are rare, so I focused on ranking and precision/recall tradeoffs.
  • Output is a risk score meant to support decisions, not a perfect forecast.

Data and Feature Engineering

  • Started with the HHS hospital-capacity time series (state × day) and handled missing data with forward-fills and pruning.
  • Added rolling-window features and lag features (1/3/7/14 days) to capture trend and momentum.
  • Built ratio features like ICU beds with COVID (%) to normalize across states.

Modeling and Evaluation

  • Trained an XGBoost classifier with a time-based train/test split and class-imbalance handling.
  • Measured ranking quality with AUROC (0.606) and PR-AUC (0.060).
  • Used the model as a risk scorer (probability output) rather than a hard yes/no classifier.

Explainability (SHAP)

  • Used SHAP to identify which daily metrics raise or lower the chance of an ICU crisis in the next week.
  • Key takeaway: when a high share of ICU beds are already filled by COVID patients, risk rises quickly.
  • Renamed feature labels in the SHAP plot to keep it stakeholder-friendly.

Operational Output

  • Exported a per-state, per-day CSV of 7-day breach probabilities for monitoring and reporting.
  • Most likely breach location in the final snapshot: UT on 2023-06-11 with a 7-day breach probability of 6.1%.
  • Designed the workflow to be rerun as new days arrive.

What I'd Improve

  • Calibrate probabilities and tune thresholds for a clear precision/recall target.
  • Add outside signals (vaccination, policy, mobility, variants) to improve early warning.
  • Monitor drift and retrain when feature patterns shift.

Links