Skip to content

register: restore CUDA device on deregister early exits#2215

Open
Jonathan-0256 wants to merge 1 commit into
NVIDIA:masterfrom
Jonathan-0256:fix-deregister-device-restore
Open

register: restore CUDA device on deregister early exits#2215
Jonathan-0256 wants to merge 1 commit into
NVIDIA:masterfrom
Jonathan-0256:fix-deregister-device-restore

Conversation

@Jonathan-0256

Copy link
Copy Markdown

Summary

commDeregister() saves the caller's current CUDA device and switches to
comm->cudaDev before looking up and releasing the registration handle.

Two early return paths skipped restoring the saved device:

  • invalid deregistration handle
  • registration still referenced by local/graph users

As a result, ncclCommDeregister() could leave the caller's thread with
comm->cudaDev as the current CUDA device.

This change restores the saved CUDA device on all paths after switching
devices. It also fixes the CommCheck API name used by the deregister paths.

Testing

  • Ran git diff --check

Copilot AI review requested due to automatic review settings June 5, 2026 03:18

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR improves commDeregister to return meaningful error codes (instead of always succeeding) and corrects the CommCheck call-site name based on whether the deregister path is graph-based or not.

Changes:

  • Replace unconditional ncclSuccess returns with an explicit ret propagated through a shared exit path.
  • Use NCCLCHECKGOTO for regCleanup and add a restore: label to centralize device restoration.
  • Fix CommCheck API name string for graph vs non-graph deregistration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/register/register.cc
Comment on lines +199 to +202
restore:
CUDACHECK(cudaSetDevice(saveDev));
exit:
return ncclSuccess;
return ret;
commDeregister() saves the caller's current CUDA device and switches to comm->cudaDev before looking up and releasing the registration handle.

Two early return paths skipped restoring the saved device: invalid handles and registrations that are still referenced. This could leave callers of ncclCommDeregister() with a different current CUDA device after the API returned.

Restore the saved CUDA device on all paths after switching devices, and fix the CommCheck API name used by the deregister paths.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Jonathan-0256 <1695830673@qq.com>
@Jonathan-0256 Jonathan-0256 force-pushed the fix-deregister-device-restore branch from 34dcbca to 1a9a40e Compare June 5, 2026 03:25
@xiaofanl-nvidia

Copy link
Copy Markdown
Collaborator

++ @KaimingOuyang to review and mirror. CC @AddyLaddy

@KaimingOuyang

Copy link
Copy Markdown
Contributor

/mirror

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants