Add cleaning script for the generated health module (SF-12)#84
Open
hmgaudecker wants to merge 6 commits into
Open
Add cleaning script for the generated health module (SF-12)#84hmgaudecker wants to merge 6 commits into
hmgaudecker wants to merge 6 commits into
Conversation
The health dataset (biennial since 2002) carries the SOEP version of
the SF-12v2: the physical and mental component summary scales plus the
eight norm-based subscales (standardized to mean 50 / SD 10 in the
SOEP 2004 population), a completeness flag for the twelve underlying
items, and BMI.
Variable names verified against paneldata.org (pcs, mcs, pf_nbs,
mh_nbs, valid, bmi, pid, cid, syear); the remaining six subscales
follow the documented *_nbs naming scheme. The exact label string of
`valid` ("[1] Yes", mirroring pequiv's English labels) should be
spot-checked on the first run against real data.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
The health module's SF-12 component/subscale scores and BMI arrive from the raw data as float64, but `object_to_float` requires object dtype and raised `TypeError: Expected dtype object, got float64` in the pipeline. Add a `float_to_float` utility (mirroring `float_to_int`) that codes SOEP negative missing codes as NA and applies the smallest float dtype, and use it for the already-numeric health columns. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ation into feat/health-module
MImmesberger
approved these changes
Jun 18, 2026
| return apply_smallest_int_dtype(series=series) | ||
|
|
||
|
|
||
| def float_to_float( |
Member
There was a problem hiding this comment.
This duplicates _remove_missing_data_values in combination with apply_smallest_float_dtype. I think it makes sense to keep these two because, depending on context, we want to replace -2 with NA and sometimes with 0.0.
(The name _remove_missing_data_values is a bit confusing, what it truly does is transform negatives to NA, a rename would be good.)
|
|
||
| # Component summary scales. The SF-12 scores and BMI arrive as floats with | ||
| # SOEP missing codes encoded as negative values. | ||
| out["sf12_pcs"] = float_to_float(raw_data["pcs"], code_negative_values_as_na=True) |
Member
There was a problem hiding this comment.
No idea what sf12, mcs, nbs, pcs actually mean, but keep it if it's jargon.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
clean_modules/health.pyfor the generatedhealthdataset (biennial since 2002), following the existingclean(raw_data)conventions:sf12_pcs,sf12_mcs— SF-12v2 physical / mental component summary scales (norm-based, mean 50 / SD 10 in the SOEP 2004 population)sf12_physical_functioning_nbs…sf12_mental_health_nbs)sf12_valid— completeness flag for the twelve underlying itemsbmi_health— BMI as provided in the generated filep_id,hh_id_original,survey_year(the health file carriespid/cid/syearbut nohid)Verification
Variable names were verified against paneldata.org (
pcs,mcs,pf_nbs,mh_nbs,valid,bmi,pid,cid,syear); the remaining six subscales follow the documented*_nbsnaming scheme. I could not run the pipeline against raw data, so two things deserve a spot-check on the first run:*_nbssubscale namesvalid("[1] Yes", mirroring pequiv's English-labeled items such ase11102)Lint hooks (
prek run --files src/soep_preparation/clean_modules/health.py) pass.🤖 Generated with Claude Code