fix: null comm->groupJob after abort to avoid use-after-free#2251
Open
EylonKrause wants to merge 1 commit into
Open
fix: null comm->groupJob after abort to avoid use-after-free#2251EylonKrause wants to merge 1 commit into
EylonKrause wants to merge 1 commit into
Conversation
ncclCommEnsureReady's abort branch calls ncclGroupJobAbort(comm->groupJob), which decrements the job's refcount and deletes it when it reaches zero, but unlike the success path in ncclCommGetAsyncError it leaves comm->groupJob pointing at the freed object. A later abort on the same comm (e.g. ncclCommRevoke followed by ncclCommDestroy, both of which route through ncclCommEnsureReady's abort branch) then re-reads the dangling pointer, a heap use-after-free. Null comm->groupJob after aborting it, mirroring the success path which already sets comm->groupJob = NULL after ncclGroupJobComplete. Confirmed with AddressSanitizer (ASAN=1) on a single rank with blocking=0: ncclCommInitRankConfig -> ncclCommRevoke -> ncclCommDestroy reports heap-use-after-free in ncclGroupJobAbort before the fix and is clean after. Signed-off-by: EylonKrause <eylon1909@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
ncclCommEnsureReady's abort branch (src/init.cc) callsncclGroupJobAbort(comm->groupJob), which decrements the group job's refcount anddeletes it when the count reaches zero — but, unlike the success path inncclCommGetAsyncError(which setscomm->groupJob = NULLafterncclGroupJobComplete), it leavescomm->groupJobpointing at the freed object.A second abort on the same communicator then re-reads that dangling pointer. This occurs on a normal, supported sequence —
ncclCommRevokefollowed byncclCommDestroy, both of which route throughncclCommEnsureReady's abort branch: revoke frees the job, destroy re-enters the branch on the danglingcomm->groupJob→ heap use-after-free.Related Issues
None.
Changes & Impact
src/init.cc: setcomm->groupJob = NULLimmediately afterncclGroupJobAbort(comm->groupJob), restoring the same invariant the success path already enforces. One line; no behavior change on any path that does not take the abort branch. Non-breaking.Performance Impact
None.
Testing
Reproduced and verified with AddressSanitizer (
make src.build ASAN=1), single rank,blocking=0, sequencencclCommInitRankConfig→ncclCommRevoke→ncclCommDestroy:Before (use-after-free):
After: no use-after-free. Builds clean (
ASAN=1and normal).