Skip to content

Avoid RMA proxy when cuMem host is unavailable#2253

Open
jm99k56 wants to merge 1 commit into
NVIDIA:masterfrom
jm99k56:guard-rma-proxy-cumem-host
Open

Avoid RMA proxy when cuMem host is unavailable#2253
jm99k56 wants to merge 1 commit into
NVIDIA:masterfrom
jm99k56:guard-rma-proxy-cumem-host

Conversation

@jm99k56

@jm99k56 jm99k56 commented Jun 25, 2026

Copy link
Copy Markdown

Guard CPU-accessible allocations with GDRCopy or cuMem host support.

Description

RMA proxy control buffers are allocated through allocMemCPUAccessible().
That helper uses GDRCopy when available, otherwise it falls back to
ncclCuMemHostAlloc().

When cuMem host allocations fail the runtime self-test, the SHM path falls
back to /dev/shm, but the RMA proxy path could still call
ncclCuMemHostAlloc() and later fail in cuMemCreate(HOST_NUMA).

This PR adds a shared CPU-accessible memory capability check and uses it to
avoid enabling multi-LSA RMA proxy when neither GDRCopy nor cuMem host
allocations are available. It also adds a defensive guard in
allocMemCPUAccessible().

Related Issues

none

Changes & Impact

  • Add ncclCpuAccessibleMemSupported() for GDRCopy/cuMem-host-backed
    CPU-accessible control memory.
  • Disable multi-LSA RMA proxy support when CPU-accessible control memory is
    unavailable.
  • Prevent allocMemCPUAccessible() from falling through to
    ncclCuMemHostAlloc() after cuMem host support has been disabled.
  • No public API changes.
  • No ABI changes.

In unsupported environments, host RMA proxy is now reported as unsupported
instead of failing later with CUDA invalid argument.

Performance Impact

No performance impact is expected for supported paths.

Existing behavior is preserved when either GDRCopy or cuMem host allocations
are available. The change only gates unsupported RMA proxy configurations and
adds a defensive allocation check.

Guard CPU-accessible allocations with GDRCopy or cuMem host support.

Signed-off-by: Lucas Wong <lucaswongchn@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant