Skip to content

fix: propagate error from ncclCudaContextTrack#2249

Open
EylonKrause wants to merge 1 commit into
NVIDIA:masterfrom
EylonKrause:fix/strongstream-track-error
Open

fix: propagate error from ncclCudaContextTrack#2249
EylonKrause wants to merge 1 commit into
NVIDIA:masterfrom
EylonKrause:fix/strongstream-track-error

Conversation

@EylonKrause

Copy link
Copy Markdown

Description

ncclCudaContextTrack (src/misc/strongstream.cc) captures the result of ncclStrongStreamConstruct via NCCLCHECKGOTO(ncclStrongStreamConstruct(&p->launchOrder), result, leave), but the leave: label hardcodes return ncclSuccess. So when stream/event creation fails (e.g. CUDA OOM or an invalid context), the error is swallowed: the caller receives ncclSuccess together with a half-initialized ncclCudaContext (refCount=1, launchOrder never constructed) that has already been linked into the global context list, and which is later passed to ncclStrongStreamDestruct/...Acquire operating on a stream that was never created.

result is initialized to ncclSuccess and is written only by that single NCCLCHECKGOTO, so returning it equals ncclSuccess on every path except the genuine failure path — where it now correctly propagates the error.

Related Issues

None.

Changes & Impact

  • src/misc/strongstream.cc: ncclCudaContextTrack returns result instead of a hardcoded ncclSuccess. No change on success paths; on construct failure the error now propagates so the caller fails init instead of proceeding with a corrupt context. Non-breaking.

Performance Impact

None — error-path correctness only.

Testing

  • Builds clean with make src.build (no new warnings).
  • Verified by inspection: result is ncclSuccess on every success path (initialized so, and only ever written by the one NCCLCHECKGOTO), so the observable behavior changes only on the failure path, where the previously-swallowed error is now returned.

Possible follow-up

This makes the failure observable. A fuller change could also unlink and free the half-built node from the global list on failure, but that involves partial-construct resource cleanup inside ncclStrongStreamConstruct, so it's left separate to keep this fix minimal and low-risk.

ncclCudaContextTrack captured ncclStrongStreamConstruct's result via NCCLCHECKGOTO but the leave: label hardcoded return ncclSuccess, so a failed stream/event creation was reported as success with a half-initialized context already linked into the global list. Return result so the error propagates.

Signed-off-by: EylonKrause <eylon1909@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant