Skip to content

fix: null comm->groupJob after abort to avoid use-after-free#2251

Open
EylonKrause wants to merge 1 commit into
NVIDIA:masterfrom
EylonKrause:fix/groupjob-uaf
Open

fix: null comm->groupJob after abort to avoid use-after-free#2251
EylonKrause wants to merge 1 commit into
NVIDIA:masterfrom
EylonKrause:fix/groupjob-uaf

Conversation

@EylonKrause

Copy link
Copy Markdown

Description

ncclCommEnsureReady's abort branch (src/init.cc) calls ncclGroupJobAbort(comm->groupJob), which decrements the group job's refcount and deletes it when the count reaches zero — but, unlike the success path in ncclCommGetAsyncError (which sets comm->groupJob = NULL after ncclGroupJobComplete), it leaves comm->groupJob pointing at the freed object.

A second abort on the same communicator then re-reads that dangling pointer. This occurs on a normal, supported sequence — ncclCommRevoke followed by ncclCommDestroy, both of which route through ncclCommEnsureReady's abort branch: revoke frees the job, destroy re-enters the branch on the dangling comm->groupJobheap use-after-free.

Related Issues

None.

Changes & Impact

  • src/init.cc: set comm->groupJob = NULL immediately after ncclGroupJobAbort(comm->groupJob), restoring the same invariant the success path already enforces. One line; no behavior change on any path that does not take the abort branch. Non-breaking.

Performance Impact

None.

Testing

Reproduced and verified with AddressSanitizer (make src.build ASAN=1), single rank, blocking=0, sequence ncclCommInitRankConfigncclCommRevokencclCommDestroy:

Before (use-after-free):

ERROR: AddressSanitizer: heap-use-after-free
READ of size 1 ... in ncclGroupJobAbort(ncclGroupJob*)   src/group.cc:903
  #1 ncclCommEnsureReady(ncclComm*)                      src/init.cc:426
  #2 ncclCommDestroy                                     src/init.cc:2901
freed by thread T0 here:
  #1 ncclGroupJobAbort(ncclGroupJob*)                    src/group.cc:909
  #2 ncclCommEnsureReady(ncclComm*)                      src/init.cc:426   (during ncclCommRevoke)
SUMMARY: AddressSanitizer: heap-use-after-free src/group.cc:903 in ncclGroupJobAbort

After: no use-after-free. Builds clean (ASAN=1 and normal).

ncclCommEnsureReady's abort branch calls ncclGroupJobAbort(comm->groupJob), which decrements the job's refcount and deletes it when it reaches zero, but unlike the success path in ncclCommGetAsyncError it leaves comm->groupJob pointing at the freed object. A later abort on the same comm (e.g. ncclCommRevoke followed by ncclCommDestroy, both of which route through ncclCommEnsureReady's abort branch) then re-reads the dangling pointer, a heap use-after-free. Null comm->groupJob after aborting it, mirroring the success path which already sets comm->groupJob = NULL after ncclGroupJobComplete. Confirmed with AddressSanitizer (ASAN=1) on a single rank with blocking=0: ncclCommInitRankConfig -> ncclCommRevoke -> ncclCommDestroy reports heap-use-after-free in ncclGroupJobAbort before the fix and is clean after.

Signed-off-by: EylonKrause <eylon1909@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant