Summary
uv run torchrun ... resolves to ~/.local/bin/torchrun when a prior pip install --user torch exists, and the spawned worker processes inherit the system Python from that binary's shebang rather than the project's .venv interpreter. The workers cannot import kempnerforge, so every rank crashes with ModuleNotFoundError before the training loop starts.
Repro
Environment:
- Fresh clone with
uv sync completed; uv run python -c "import kempnerforge, torch" succeeds.
~/.local/bin/torchrun present from an earlier pip install --user (common on shared HPC accounts).
which torchrun → ~/.local/bin/torchrun.
Command:
uv run torchrun --standalone --nproc_per_node=4 scripts/train.py \
configs/train/hf_wikitext.toml [overrides...]
Result on every rank:
File "/n/home10/<user>/.local/bin/torchrun", line 8, in <module>
...
ModuleNotFoundError: No module named 'kempnerforge'
exitcode: 1 (pid: ...) of binary: /n/sw/Miniforge3-25.3.1-0/bin/python3.12
The launched binary is the Miniforge base Python, not .venv/bin/python3, even though uv run was used.
Root cause
~/.local/bin/torchrun is resolved ahead of .venv/bin/torchrun under uv run, and its shebang targets the system Python. Even when the launcher itself runs, the workers it spawns inherit the launcher's Python interpreter, bypassing the venv. The venv is healthy — only the launcher resolution is wrong.
Workaround
Replace uv run torchrun with uv run python -m torch.distributed.run. Equivalent semantics, but forces the venv's Python end-to-end (both launcher and workers):
uv run python -m torch.distributed.run --standalone --nproc_per_node=4 \
scripts/train.py configs/train/hf_wikitext.toml [overrides...]
Affected files
scripts/slurm/singlenode.sh:67 — uses uv run torchrun; will hit this failure for any user with ~/.local/bin/torchrun ahead of the venv.
docs/getting-started/quickstart.md — multi-GPU step in the quickstart shows uv run torchrun.
Summary
uv run torchrun ...resolves to~/.local/bin/torchrunwhen a priorpip install --user torchexists, and the spawned worker processes inherit the system Python from that binary's shebang rather than the project's.venvinterpreter. The workers cannot importkempnerforge, so every rank crashes withModuleNotFoundErrorbefore the training loop starts.Repro
Environment:
uv synccompleted;uv run python -c "import kempnerforge, torch"succeeds.~/.local/bin/torchrunpresent from an earlierpip install --user(common on shared HPC accounts).which torchrun→~/.local/bin/torchrun.Command:
Result on every rank:
The launched binary is the Miniforge base Python, not
.venv/bin/python3, even thoughuv runwas used.Root cause
~/.local/bin/torchrunis resolved ahead of.venv/bin/torchrununderuv run, and its shebang targets the system Python. Even when the launcher itself runs, the workers it spawns inherit the launcher's Python interpreter, bypassing the venv. The venv is healthy — only the launcher resolution is wrong.Workaround
Replace
uv run torchrunwithuv run python -m torch.distributed.run. Equivalent semantics, but forces the venv's Python end-to-end (both launcher and workers):Affected files
scripts/slurm/singlenode.sh:67— usesuv run torchrun; will hit this failure for any user with~/.local/bin/torchrunahead of the venv.docs/getting-started/quickstart.md— multi-GPU step in the quickstart showsuv run torchrun.