Skip to content

Add cleaning script for the generated health module (SF-12)#84

Open
hmgaudecker wants to merge 6 commits into
mainfrom
feat/health-module
Open

Add cleaning script for the generated health module (SF-12)#84
hmgaudecker wants to merge 6 commits into
mainfrom
feat/health-module

Conversation

@hmgaudecker

Copy link
Copy Markdown
Collaborator

Summary

Adds clean_modules/health.py for the generated health dataset (biennial since 2002), following the existing clean(raw_data) conventions:

  • sf12_pcs, sf12_mcs — SF-12v2 physical / mental component summary scales (norm-based, mean 50 / SD 10 in the SOEP 2004 population)
  • the eight norm-based subscales (sf12_physical_functioning_nbssf12_mental_health_nbs)
  • sf12_valid — completeness flag for the twelve underlying items
  • bmi_health — BMI as provided in the generated file
  • identifiers p_id, hh_id_original, survey_year (the health file carries pid/cid/syear but no hid)

Verification

Variable names were verified against paneldata.org (pcs, mcs, pf_nbs, mh_nbs, valid, bmi, pid, cid, syear); the remaining six subscales follow the documented *_nbs naming scheme. I could not run the pipeline against raw data, so two things deserve a spot-check on the first run:

  • the six unchecked *_nbs subscale names
  • the exact label string of valid ("[1] Yes", mirroring pequiv's English-labeled items such as e11102)

Lint hooks (prek run --files src/soep_preparation/clean_modules/health.py) pass.

🤖 Generated with Claude Code

The health dataset (biennial since 2002) carries the SOEP version of
the SF-12v2: the physical and mental component summary scales plus the
eight norm-based subscales (standardized to mean 50 / SD 10 in the
SOEP 2004 population), a completeness flag for the twelve underlying
items, and BMI.

Variable names verified against paneldata.org (pcs, mcs, pf_nbs,
mh_nbs, valid, bmi, pid, cid, syear); the remaining six subscales
follow the documented *_nbs naming scheme. The exact label string of
`valid` ("[1] Yes", mirroring pequiv's English labels) should be
spot-checked on the first run against real data.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

hmgaudecker and others added 5 commits June 17, 2026 06:06
The health module's SF-12 component/subscale scores and BMI arrive from the raw
data as float64, but `object_to_float` requires object dtype and raised
`TypeError: Expected dtype object, got float64` in the pipeline.

Add a `float_to_float` utility (mirroring `float_to_int`) that codes SOEP
negative missing codes as NA and applies the smallest float dtype, and use it
for the already-numeric health columns.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@hmgaudecker hmgaudecker requested a review from MImmesberger June 18, 2026 09:04
return apply_smallest_int_dtype(series=series)


def float_to_float(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This duplicates _remove_missing_data_values in combination with apply_smallest_float_dtype. I think it makes sense to keep these two because, depending on context, we want to replace -2 with NA and sometimes with 0.0.

(The name _remove_missing_data_values is a bit confusing, what it truly does is transform negatives to NA, a rename would be good.)


# Component summary scales. The SF-12 scores and BMI arrive as floats with
# SOEP missing codes encoded as negative values.
out["sf12_pcs"] = float_to_float(raw_data["pcs"], code_negative_values_as_na=True)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea what sf12, mcs, nbs, pcs actually mean, but keep it if it's jargon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants