Skip to content

Conversation

@aingerson
Copy link

In OFI, the FI_HMEM_DEVICE_ONLY registration flag signals to the provider that the memory is only on the device and is not unified memory (which can migrate between the GPU and host). IPC is only usable with device only memory and is not valid for unified memory. Without this flag, providers cannot provide optimizations like IPC. Set the flag if the address was found to be non-unified memory. This enables IPC copies in OFI.

This also includes an indentation fix within the same function

@aingerson
Copy link
Author

@angainor

@hppritcha hppritcha requested a review from bwbarrett December 18, 2025 17:52
Copy link
Member

@bwbarrett bwbarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a similar patch in the ofi btl?

bwbarrett
bwbarrett previously approved these changes Dec 18, 2025
@aingerson aingerson changed the title mtl/ofi: set device only flag btl,mtl/ofi: set device only flag Dec 18, 2025
}
#if OPAL_OFI_HAVE_FI_HMEM_DEVICE_ONLY
mr_flags = flags & MCA_ACCELERATOR_FLAGS_UNIFIED_MEMORY ? 0 :
FI_HMEM_DEVICE_ONLY

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aingerson you missed a ; after FI_HMEM_DEVICE_ONLY here...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angainor Eek! Thank you! I'm concerned the CI didn't fail this... @bwbarrett Do you happen to know why this path is never built in the CI?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary reason is because none of the OMPI devs approved the workflows (since you're not part of the OMPI org).

The secondary reason is that it looks like something has broken in CI and OMPI is no longer building with Libfabric support in the CI. I'll try to get this fixed next week while I'm OOTO. We haven't been great at running with latest Libfabric, so I should fix that, too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah gotcha. Let me know if I can support that at all. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kicked off a PR CI for this PR in our internal jenkins which should test with Libfabric ... I will report back for the latest push

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest push of this PR passed AWS CI that runs with Libfabric built

}
#if OPAL_OFI_HAVE_FI_HMEM_DEVICE_ONLY
mr_flags = flags & MCA_ACCELERATOR_FLAGS_UNIFIED_MEMORY ? 0 :
FI_HMEM_DEVICE_ONLY

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aingerson and here :)

@angainor
Copy link

Otherwise I tested the mtl on our GH200 system, works as it should.

In OFI, the FI_HMEM_DEVICE_ONLY registration flag signals to the
provider that the memory is only on the device and is not unified
memory (which can migrate between the GPU and host). IPC is only
usable with device only memory and is not valid for unified memory.
Without this flag, providers cannot provide optimizations like IPC.
Set the flag if the address was found to be non-unified memory.
This enables IPC copies in OFI.

The flag is available starting in v1.13.0 so this adds a configure
check to make sure we only use it if available.

This also includes an indentation fix and typos within the scope of
the patch

Signed-off-by: Zach Dworkin <[email protected]>
Signed-off-by: Alexia Ingerson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants