ttsim-dev · hmgaudecker · May 17, 2026 · May 20, 2026 · May 22, 2026 · May 23, 2026
diff --git a/.ai-instructions b/.ai-instructions
diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
@@ -27,9 +27,9 @@ jobs:
           - '3.14'
     steps:
       - uses: actions/checkout@v6
-      - uses: prefix-dev/setup-pixi@v0.9.3
+      - uses: prefix-dev/setup-pixi@v0.9.5
         with:
-          pixi-version: v0.62.2
+          pixi-version: v0.69.0
           cache: true
           cache-write: ${{ github.event_name == 'push' && github.ref_name == 'main' }}
           frozen: true

diff --git a/.gitmodules b/.gitmodules
@@ -1,4 +1,4 @@
 [submodule ".ai-instructions"]
 	path = .ai-instructions
-	url = git@github.com:OpenSourceEconomics/ai-instructions.git
-	branch = make-submodule
+	url = https://github.com/OpenSourceEconomics/ai-instructions.git
+	branch = main
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -6,7 +6,7 @@ repos:
       - id: check-useless-excludes
       # - id: identity  # Prints all files passed to pre-commits. Debugging.
   - repo: https://github.com/tox-dev/pyproject-fmt
-    rev: v2.21.1
+    rev: v2.21.2
     hooks:
       - id: pyproject-fmt
   - repo: https://github.com/lyz-code/yamlfix
@@ -48,8 +48,12 @@ repos:
     hooks:
       - id: yamllint
         exclude: variable_to_metadata_mapping\.yaml
+  - repo: https://github.com/python-jsonschema/check-jsonschema
+    rev: 0.37.2
+    hooks:
+      - id: check-github-workflows
   - repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: v0.15.11
+    rev: v0.15.14
     hooks:
       - id: ruff-check
         types_or:

diff --git a/AGENTS.md b/AGENTS.md
@@ -0,0 +1,111 @@
+@.ai-instructions/profiles/tier-b-research.md
+
+# soep-preparation
+
+## Overview
+
+**soep-preparation** is a data pipeline for preparing German Socio-Economic Panel (SOEP)
+survey data. It converts raw Stata `.dta` files into typed, cleaned pandas DataFrames
+with standardized variable names, then creates a metadata catalog and a helper for
+building final merged datasets.
+
+Part of the [gettsim ecosystem](https://github.com/ttsim-dev/ttsim) — the output of this
+pipeline feeds into **gettsim** for microsimulation of the German tax and transfer
+system. Built on **pytask** for task orchestration and **pixi** for environment
+management.
+
+## Commands
+
+```bash
+# Environment setup
+pixi install
+
+# Run the full data pipeline
+pixi run pytask
+
+# Run tests
+pixi run -e py314 tests
+pixi run -e py314 tests-with-cov   # with coverage
+pixi run -e py314 tests -n 7       # parallel with xdist
+
+# Run a single test file
+pixi run -e py314 tests tests/clean_existing_variables/test_create_dummy.py
+
+# Run a single test by name
+pixi run -e py314 tests -k "test_name"
+
+# Type checking
+pixi run -e py314 ty
+
+# Quality checks (linting, formatting, codespell, etc.)
+prek run --all-files
+
+# Available environments: py314
+```
+
+## Verification Before Finishing Any Code Change
+
+Before finishing any task that modifies code, always run these three verification steps
+in order:
+
+1. `pixi run -e py314 ty` (type checker)
+1. `prek run --all-files` (quality checks: linting, formatting, yaml, etc.)
+1. `pixi run -e py314 tests -n 7` (full test suite)
+
+## Architecture
+
+### Pipeline Stages (pytask DAG)
+
+1. **convert_stata_to_pandas/** — Reads raw `.dta` files into pandas DataFrames, stored
+   in `RAW_DATA_FILES` DataCatalog.
+1. **clean_modules/** — Per-module cleaning scripts (e.g. `pbrutto.py`, `pequiv.py`).
+   Each exposes a `clean(raw_data: pd.DataFrame) -> pd.DataFrame` function. Results
+   stored in `MODULES` DataCatalog.
+1. **combine_modules/** — Derives new variables from multiple cleaned modules (e.g.
+   `pequiv_pl.py` creates BMI from health variables). Each exposes a `combine(...)`
+   function.
+1. **create_metadata/** — Generates `variable_to_metadata_mapping.yaml` mapping each
+   variable to its module, dtype, and available survey years.
+
+### Key Modules
+
+- **config.py** — Global constants (`SOEP_VERSION`, `SURVEY_YEARS`, path constants),
+  DataCatalogs (`RAW_DATA_FILES`, `MODULES`), `METADATA` dict, pandas options.
+- **final_dataset.py** — `create_final_dataset(modules, variables, survey_years)` merges
+  selected variables from multiple modules via outer join on index variables.
+- **utilities/data_manipulator.py** — Core transformation functions: `object_to_int`,
+  `object_to_float`, `object_to_str_categorical`, `object_to_bool_categorical`,
+  `apply_smallest_int_dtype`, `create_dummy`, `combine_first_and_make_categorical`, etc.
+- **utilities/error_handling.py** — Validation with `fail_if_*` pattern.
+- **utilities/general.py** — File discovery helpers (`get_raw_data_file_names`,
+  `get_combine_module_names`, `load_script`).
+
+### Data Flow Pattern
+
+Cleaning scripts follow a consistent pattern:
+
+```python
+def clean(raw_data: pd.DataFrame) -> pd.DataFrame:
+    out = pd.DataFrame()
+    out["new_var"] = cleaning_function(raw_data["raw_var"])
+    return out
+```
+
+Combined modules follow:
+
+```python
+def combine(module_a: pd.DataFrame, module_b: pd.DataFrame) -> pd.DataFrame: ...
+```
+
+## Code Conventions
+
+- **PyArrow-backed dtypes** throughout (`int8[pyarrow]`, `string[pyarrow]`,
+  `bool[pyarrow]`, etc.) for memory efficiency.
+- **SOEP missing data**: negative single-digit values (-1 to -8) and strings like
+  `"[-1] Missing"` are converted to `pd.NA`.
+- **Index variables**: `p_id`, `hh_id`, `hh_id_original`, `survey_year`.
+- **Ruff with `ALL` rules** enabled (see `pyproject.toml` for specific ignores).
+  Google-style docstrings.
+- **Type checker**: `ty` (not mypy).
+- No direct commits to `main` (enforced by pre-commit hook).
+- Markdown files are wrapped at 88 characters (`mdformat`).
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -1,114 +1 @@
-@.ai-instructions/profiles/tier-b-research.md
-
-# CLAUDE.md
-
-This file provides guidance to Claude Code (claude.ai/code) when working with code in
-this repository.
-
-## Project Overview
-
-**soep-preparation** is a data pipeline for preparing German Socio-Economic Panel (SOEP)
-survey data. It converts raw Stata `.dta` files into typed, cleaned pandas DataFrames
-with standardized variable names, then creates a metadata catalog and a helper for
-building final merged datasets.
-
-Part of the [gettsim ecosystem](https://github.com/ttsim-dev/ttsim) — the output of this
-pipeline feeds into **gettsim** for microsimulation of the German tax and transfer
-system. Built on **pytask** for task orchestration and **pixi** for environment
-management.
-
-## Commands
-
-```bash
-# Environment setup
-pixi install
-
-# Run the full data pipeline
-pixi run pytask
-
-# Run tests
-pixi run -e py314 tests
-pixi run -e py314 tests-with-cov   # with coverage
-pixi run -e py314 tests -n 7       # parallel with xdist
-
-# Run a single test file
-pixi run -e py314 tests tests/clean_existing_variables/test_create_dummy.py
-
-# Run a single test by name
-pixi run -e py314 tests -k "test_name"
-
-# Type checking
-pixi run -e py314 ty
-
-# Quality checks (linting, formatting, codespell, etc.)
-prek run --all-files
-
-# Available environments: py314
-```
-
-## Verification Before Finishing Any Code Change
-
-Before finishing any task that modifies code, always run these three verification steps
-in order:
-
-1. `pixi run -e py314 ty` (type checker)
-1. `prek run --all-files` (quality checks: linting, formatting, yaml, etc.)
-1. `pixi run -e py314 tests -n 7` (full test suite)
-
-## Architecture
-
-### Pipeline Stages (pytask DAG)
-
-1. **convert_stata_to_pandas/** — Reads raw `.dta` files into pandas DataFrames, stored
-   in `RAW_DATA_FILES` DataCatalog.
-1. **clean_modules/** — Per-module cleaning scripts (e.g. `pbrutto.py`, `pequiv.py`).
-   Each exposes a `clean(raw_data: pd.DataFrame) -> pd.DataFrame` function. Results
-   stored in `MODULES` DataCatalog.
-1. **combine_modules/** — Derives new variables from multiple cleaned modules (e.g.
-   `pequiv_pl.py` creates BMI from health variables). Each exposes a `combine(...)`
-   function.
-1. **create_metadata/** — Generates `variable_to_metadata_mapping.yaml` mapping each
-   variable to its module, dtype, and available survey years.
-
-### Key Modules
-
-- **config.py** — Global constants (`SOEP_VERSION`, `SURVEY_YEARS`, path constants),
-  DataCatalogs (`RAW_DATA_FILES`, `MODULES`), `METADATA` dict, pandas options.
-- **final_dataset.py** — `create_final_dataset(modules, variables, survey_years)` merges
-  selected variables from multiple modules via outer join on index variables.
-- **utilities/data_manipulator.py** — Core transformation functions: `object_to_int`,
-  `object_to_float`, `object_to_str_categorical`, `object_to_bool_categorical`,
-  `apply_smallest_int_dtype`, `create_dummy`, `combine_first_and_make_categorical`, etc.
-- **utilities/error_handling.py** — Validation with `fail_if_*` pattern.
-- **utilities/general.py** — File discovery helpers (`get_raw_data_file_names`,
-  `get_combine_module_names`, `load_script`).
-
-### Data Flow Pattern
-
-Cleaning scripts follow a consistent pattern:
-
-```python
-def clean(raw_data: pd.DataFrame) -> pd.DataFrame:
-    out = pd.DataFrame()
-    out["new_var"] = cleaning_function(raw_data["raw_var"])
-    return out
-```
-
-Combined modules follow:
-
-```python
-def combine(module_a: pd.DataFrame, module_b: pd.DataFrame) -> pd.DataFrame: ...
-```
-
-## Code Conventions
-
-- **PyArrow-backed dtypes** throughout (`int8[pyarrow]`, `string[pyarrow]`,
-  `bool[pyarrow]`, etc.) for memory efficiency.
-- **SOEP missing data**: negative single-digit values (-1 to -8) and strings like
-  `"[-1] Missing"` are converted to `pd.NA`.
-- **Index variables**: `p_id`, `hh_id`, `hh_id_original`, `survey_year`.
-- **Ruff with `ALL` rules** enabled (see `pyproject.toml` for specific ignores).
-  Google-style docstrings.
-- **Type checker**: `ty` (not mypy).
-- No direct commits to `main` (enforced by pre-commit hook).
-- Markdown files are wrapped at 88 characters (`mdformat`).
+@AGENTS.md
diff --git a/GEMINI.md b/GEMINI.md
@@ -0,0 +1 @@
+@AGENTS.md
+2 −2		.pre-commit-config.yaml
+239 −1		AGENTS.md
+1 −3		CLAUDE.md
+17 −8		README.md
+57 −35		boilerplate/README.md
+127 −17		commands/boilerplate-update.md
+4 −0		commands/verify-standards.md
+380 −0		modules/beartype.md
+117 −0		modules/dags.md
+13 −0		modules/jax.md
+2 −0		modules/plotting.md
+22 −0		modules/project-structure.md
+1 −1		profiles/tier-a.md
+0 −43		repos.md