register: restore CUDA device on deregister early exits#2215
Open
Jonathan-0256 wants to merge 1 commit into
Open
register: restore CUDA device on deregister early exits#2215Jonathan-0256 wants to merge 1 commit into
Jonathan-0256 wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR improves commDeregister to return meaningful error codes (instead of always succeeding) and corrects the CommCheck call-site name based on whether the deregister path is graph-based or not.
Changes:
- Replace unconditional
ncclSuccessreturns with an explicitretpropagated through a shared exit path. - Use
NCCLCHECKGOTOforregCleanupand add arestore:label to centralize device restoration. - Fix
CommCheckAPI name string for graph vs non-graph deregistration.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+199
to
+202
| restore: | ||
| CUDACHECK(cudaSetDevice(saveDev)); | ||
| exit: | ||
| return ncclSuccess; | ||
| return ret; |
commDeregister() saves the caller's current CUDA device and switches to comm->cudaDev before looking up and releasing the registration handle. Two early return paths skipped restoring the saved device: invalid handles and registrations that are still referenced. This could leave callers of ncclCommDeregister() with a different current CUDA device after the API returned. Restore the saved CUDA device on all paths after switching devices, and fix the CommCheck API name used by the deregister paths. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Jonathan-0256 <1695830673@qq.com>
34dcbca to
1a9a40e
Compare
Collaborator
|
++ @KaimingOuyang to review and mirror. CC @AddyLaddy |
Contributor
|
/mirror |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
commDeregister()saves the caller's current CUDA device and switches tocomm->cudaDevbefore looking up and releasing the registration handle.Two early return paths skipped restoring the saved device:
As a result,
ncclCommDeregister()could leave the caller's thread withcomm->cudaDevas the current CUDA device.This change restores the saved CUDA device on all paths after switching
devices. It also fixes the
CommCheckAPI name used by the deregister paths.Testing
git diff --check