Skip to content

SaeMind/sdoh-readmission

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SDOH Readmission Risk Stratification

Author: Andrew Lee | UTHealth Houston SBMI
Stack: SQL (PostgreSQL) → Python → R → Tableau
Dataset: HCUP State Inpatient Databases (SID) + CMS-HCC mapping | 2016–2022


Overview

This project develops and validates a 30-day all-cause hospital readmission prediction model that quantifies the incremental value of Social Determinants of Health (SDOH) features over clinical-only baselines using nationally representative inpatient discharge data.

Primary dataset: HCUP SID (Healthcare Cost and Utilization Project — State Inpatient Databases) available from AHRQ. No institutional credentials required; publicly available under a Data Use Agreement (hcup-us.ahrq.gov). Supplemented with CMS-HCC risk score mapping from publicly available ICD-10-CM mapping files.

Key finding (target): AUROC ≥0.74 (SDOH+Clinical) vs ~0.70 (Clinical-only); NRI +2–8%


Data Sources

Source Variables Used Access
HCUP SID Core AGE, FEMALE, RACE, PAY1, ZIPINC_QRTL, LOS, DISPUNIFORM, ATYPE, HCUP_ED, AWEEKEND, VisitLink, DaysToEvent AHRQ DUA — free
HCUP CCSR ICD-10 diagnosis category flags (~600 CCSR categories) AHRQ — free tool
CMS-HCC Mapping ICD-10 → HCC code → HCC weight (2022 CMS-HCC v28) CMS.gov — public
Neighborhood Atlas ADI County-level Area Deprivation Index (ADI national rank) UW-Madison — free

Pipeline Architecture

SQL (PostgreSQL)          Python                   R                  Tableau
─────────────────         ──────────────────────   ────────────────   ──────────────
01_cohort.sql          → sdoh_engineering.py     → statistical_     → readmission_
02_sdoh_features.sql      merge_pipeline.py          tests.R             dashboard
03_clinical_features.sql  train_baseline.py           - DeLong AUROC
04_outcome_labels.sql  →  train_sdoh.py               - NRI/IDI
                          shap_analysis.py  ──────→  Tableau-ready CSV

Cohort

Criterion Detail
Population Adult inpatients age ≥18, HCUP SID 2016–2022
Index admission First inpatient admission per patient (VisitLink linkage)
Exclusions In-hospital deaths; hospice discharge; newborn admissions; LOS <1 day
Outcome 30-day all-cause readmission (SID DaysToEvent linkage)
Expected N ~2.5–4M index admissions (varies by state selection)

Temporal split: 2016–2019 train | 2020 val | 2021–2022 test


Feature Groups

SDOH Features (HCUP-native + ADI)

Feature HCUP Variable Notes
Primary payer PAY1 Medicare/Medicaid/Private/Self-pay/Other
Race/ethnicity RACE HCUP uniform 6-category coding
ZIP income quartile ZIPINC_QRTL 1=lowest, 4=highest; census-linked
Area Deprivation Index PSTCO → ADI crosswalk County-level via Neighborhood Atlas
ED admission route HCUP_ED Emergency vs. scheduled
Weekend admission AWEEKEND Proxy for care access quality

Clinical Features (CCSR + CMS-HCC)

Feature Group Source Detail
Elixhauser comorbidities (20) HCUP CCSR Van Walraven weighted score
CMS-HCC risk score CMS mapping files Estimated RAF from ICD-10 codes
Procedure categories HCUP PRCCSR Cardiac, orthopedic, GI, neuro
Discharge disposition DISPUNIFORM Routine/SNF/transfer/home health
LOS, age, prior admissions SID core Administrative

Models

Model Features Hypertuning Target AUROC
Baseline Clinical only (no SDOH) Optuna 60 trials, 5-fold CV ~0.70
Full Clinical + SDOH Optuna 60 trials, 5-fold CV ≥0.74

Both models: XGBoost with Youden-J optimal threshold. Class imbalance: SMOTE (target 40% positive rate on training set only).


Directory Structure

sdoh-readmission/
├── sql/
│   ├── 01_cohort_definition.sql       # Cohort + readmission flag (HCUP SID)
│   ├── 02_sdoh_features.sql           # SDOH encoding (PAY1, RACE, ZIPINC, ADI)
│   ├── 03_clinical_features.sql       # CCSR comorbidities + CMS-HCC risk
│   └── 04_outcome_labels.sql          # Master feature table + CSV export
├── src/
│   ├── features/
│   │   ├── sdoh_engineering.py        # MICE imputation, composite SDOH score
│   │   └── merge_pipeline.py          # Train/val/test splits, SMOTE, scaling
│   └── models/
│       ├── train_baseline.py          # XGBoost clinical-only
│       ├── train_sdoh.py              # XGBoost clinical+SDOH
│       └── shap_analysis.py           # SHAP beeswarm, bar, dependence plots
├── r_analysis/
│   └── statistical_tests.R            # DeLong AUROC, NRI/IDI, calibration
└── requirements.txt

Reproducing Results

# 1. Load HCUP SID data into PostgreSQL (follow AHRQ load programs)
# 2. Load CCSR mapping: hcup_sid.ccsr_dx_flags
# 3. Load CMS-HCC mapping: sdoh_readmission.cms_hcc_mapping
# 4. Load ADI crosswalk: sdoh_readmission.adi_county_crosswalk

# Run SQL pipeline
psql -f sql/01_cohort_definition.sql
psql -f sql/02_sdoh_features.sql
psql -f sql/03_clinical_features.sql
psql -f sql/04_outcome_labels.sql
# Export: \COPY sdoh_readmission.master_features TO 'data/master_features.csv' CSV HEADER

# Python pipeline
python src/features/sdoh_engineering.py --input data/master_features.csv
python src/features/merge_pipeline.py
python src/models/train_baseline.py
python src/models/train_sdoh.py
python src/models/shap_analysis.py

# R validation
Rscript r_analysis/statistical_tests.R

Releases

No releases published

Packages

 
 
 

Contributors