Author: Andrew Lee | UTHealth Houston SBMI
Stack: SQL (PostgreSQL) → Python → R → Tableau
Dataset: HCUP State Inpatient Databases (SID) + CMS-HCC mapping | 2016–2022
This project develops and validates a 30-day all-cause hospital readmission prediction model that quantifies the incremental value of Social Determinants of Health (SDOH) features over clinical-only baselines using nationally representative inpatient discharge data.
Primary dataset: HCUP SID (Healthcare Cost and Utilization Project — State Inpatient Databases) available from AHRQ. No institutional credentials required; publicly available under a Data Use Agreement (hcup-us.ahrq.gov). Supplemented with CMS-HCC risk score mapping from publicly available ICD-10-CM mapping files.
Key finding (target): AUROC ≥0.74 (SDOH+Clinical) vs ~0.70 (Clinical-only); NRI +2–8%
| Source | Variables Used | Access |
|---|---|---|
| HCUP SID Core | AGE, FEMALE, RACE, PAY1, ZIPINC_QRTL, LOS, DISPUNIFORM, ATYPE, HCUP_ED, AWEEKEND, VisitLink, DaysToEvent | AHRQ DUA — free |
| HCUP CCSR | ICD-10 diagnosis category flags (~600 CCSR categories) | AHRQ — free tool |
| CMS-HCC Mapping | ICD-10 → HCC code → HCC weight (2022 CMS-HCC v28) | CMS.gov — public |
| Neighborhood Atlas ADI | County-level Area Deprivation Index (ADI national rank) | UW-Madison — free |
SQL (PostgreSQL) Python R Tableau
───────────────── ────────────────────── ──────────────── ──────────────
01_cohort.sql → sdoh_engineering.py → statistical_ → readmission_
02_sdoh_features.sql merge_pipeline.py tests.R dashboard
03_clinical_features.sql train_baseline.py - DeLong AUROC
04_outcome_labels.sql → train_sdoh.py - NRI/IDI
shap_analysis.py ──────→ Tableau-ready CSV
| Criterion | Detail |
|---|---|
| Population | Adult inpatients age ≥18, HCUP SID 2016–2022 |
| Index admission | First inpatient admission per patient (VisitLink linkage) |
| Exclusions | In-hospital deaths; hospice discharge; newborn admissions; LOS <1 day |
| Outcome | 30-day all-cause readmission (SID DaysToEvent linkage) |
| Expected N | ~2.5–4M index admissions (varies by state selection) |
Temporal split: 2016–2019 train | 2020 val | 2021–2022 test
| Feature | HCUP Variable | Notes |
|---|---|---|
| Primary payer | PAY1 | Medicare/Medicaid/Private/Self-pay/Other |
| Race/ethnicity | RACE | HCUP uniform 6-category coding |
| ZIP income quartile | ZIPINC_QRTL | 1=lowest, 4=highest; census-linked |
| Area Deprivation Index | PSTCO → ADI crosswalk | County-level via Neighborhood Atlas |
| ED admission route | HCUP_ED | Emergency vs. scheduled |
| Weekend admission | AWEEKEND | Proxy for care access quality |
| Feature Group | Source | Detail |
|---|---|---|
| Elixhauser comorbidities (20) | HCUP CCSR | Van Walraven weighted score |
| CMS-HCC risk score | CMS mapping files | Estimated RAF from ICD-10 codes |
| Procedure categories | HCUP PRCCSR | Cardiac, orthopedic, GI, neuro |
| Discharge disposition | DISPUNIFORM | Routine/SNF/transfer/home health |
| LOS, age, prior admissions | SID core | Administrative |
| Model | Features | Hypertuning | Target AUROC |
|---|---|---|---|
| Baseline | Clinical only (no SDOH) | Optuna 60 trials, 5-fold CV | ~0.70 |
| Full | Clinical + SDOH | Optuna 60 trials, 5-fold CV | ≥0.74 |
Both models: XGBoost with Youden-J optimal threshold. Class imbalance: SMOTE (target 40% positive rate on training set only).
sdoh-readmission/
├── sql/
│ ├── 01_cohort_definition.sql # Cohort + readmission flag (HCUP SID)
│ ├── 02_sdoh_features.sql # SDOH encoding (PAY1, RACE, ZIPINC, ADI)
│ ├── 03_clinical_features.sql # CCSR comorbidities + CMS-HCC risk
│ └── 04_outcome_labels.sql # Master feature table + CSV export
├── src/
│ ├── features/
│ │ ├── sdoh_engineering.py # MICE imputation, composite SDOH score
│ │ └── merge_pipeline.py # Train/val/test splits, SMOTE, scaling
│ └── models/
│ ├── train_baseline.py # XGBoost clinical-only
│ ├── train_sdoh.py # XGBoost clinical+SDOH
│ └── shap_analysis.py # SHAP beeswarm, bar, dependence plots
├── r_analysis/
│ └── statistical_tests.R # DeLong AUROC, NRI/IDI, calibration
└── requirements.txt
# 1. Load HCUP SID data into PostgreSQL (follow AHRQ load programs)
# 2. Load CCSR mapping: hcup_sid.ccsr_dx_flags
# 3. Load CMS-HCC mapping: sdoh_readmission.cms_hcc_mapping
# 4. Load ADI crosswalk: sdoh_readmission.adi_county_crosswalk
# Run SQL pipeline
psql -f sql/01_cohort_definition.sql
psql -f sql/02_sdoh_features.sql
psql -f sql/03_clinical_features.sql
psql -f sql/04_outcome_labels.sql
# Export: \COPY sdoh_readmission.master_features TO 'data/master_features.csv' CSV HEADER
# Python pipeline
python src/features/sdoh_engineering.py --input data/master_features.csv
python src/features/merge_pipeline.py
python src/models/train_baseline.py
python src/models/train_sdoh.py
python src/models/shap_analysis.py
# R validation
Rscript r_analysis/statistical_tests.R