Skip to content

uv run torchrun resolves to user-site torchrun and bypasses the venv #68

@Naeemkh

Description

@Naeemkh

Summary

uv run torchrun ... resolves to ~/.local/bin/torchrun when a prior pip install --user torch exists, and the spawned worker processes inherit the system Python from that binary's shebang rather than the project's .venv interpreter. The workers cannot import kempnerforge, so every rank crashes with ModuleNotFoundError before the training loop starts.

Repro

Environment:

  • Fresh clone with uv sync completed; uv run python -c "import kempnerforge, torch" succeeds.
  • ~/.local/bin/torchrun present from an earlier pip install --user (common on shared HPC accounts).
  • which torchrun~/.local/bin/torchrun.

Command:

uv run torchrun --standalone --nproc_per_node=4 scripts/train.py \
  configs/train/hf_wikitext.toml [overrides...]

Result on every rank:

File "/n/home10/<user>/.local/bin/torchrun", line 8, in <module>
...
ModuleNotFoundError: No module named 'kempnerforge'
exitcode: 1 (pid: ...) of binary: /n/sw/Miniforge3-25.3.1-0/bin/python3.12

The launched binary is the Miniforge base Python, not .venv/bin/python3, even though uv run was used.

Root cause

~/.local/bin/torchrun is resolved ahead of .venv/bin/torchrun under uv run, and its shebang targets the system Python. Even when the launcher itself runs, the workers it spawns inherit the launcher's Python interpreter, bypassing the venv. The venv is healthy — only the launcher resolution is wrong.

Workaround

Replace uv run torchrun with uv run python -m torch.distributed.run. Equivalent semantics, but forces the venv's Python end-to-end (both launcher and workers):

uv run python -m torch.distributed.run --standalone --nproc_per_node=4 \
  scripts/train.py configs/train/hf_wikitext.toml [overrides...]

Affected files

  • scripts/slurm/singlenode.sh:67 — uses uv run torchrun; will hit this failure for any user with ~/.local/bin/torchrun ahead of the venv.
  • docs/getting-started/quickstart.md — multi-GPU step in the quickstart shows uv run torchrun.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions