Summary
When using ./primus-cli slurm, manually setting MASTER_ADDR can cause the job to hang if the address does not match the first node in SLURM_NODELIST. This occurs because the rank mapping relies on the Slurm node allocation.
Details
The current script respects an existing MASTER_ADDR environment variable:
|
if [[ -z "${MASTER_ADDR:-}" ]]; then |
If a user (or a previous script) sets MASTER_ADDR incorrectly (e.g., to a node not in the current allocation or not the head node), the training hangs indefinitely.
Suggested Fix
- Remapping NODE_RANK to handle user's MASTER_ADDR.
- Fatal error if
MASTER_ADDR is set and does not match SLURM_NODELIST[0] to improve UX/DX
Summary
When using
./primus-cli slurm, manually settingMASTER_ADDRcan cause the job to hang if the address does not match the first node inSLURM_NODELIST. This occurs because the rank mapping relies on the Slurm node allocation.Details
The current script respects an existing
MASTER_ADDRenvironment variable:Primus/runner/primus-cli-slurm-entry.sh
Line 123 in 02ca70d
If a user (or a previous script) sets
MASTER_ADDRincorrectly (e.g., to a node not in the current allocation or not the head node), the training hangs indefinitely.Suggested Fix
MASTER_ADDRis set and does not matchSLURM_NODELIST[0]to improve UX/DX