-
Notifications
You must be signed in to change notification settings - Fork 936
btl,mtl/ofi: set device only flag #13582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
bwbarrett
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a similar patch in the ofi btl?
ompi/mca/mtl/ofi/mtl_ofi_mr.c
Outdated
| } | ||
| #if OPAL_OFI_HAVE_FI_HMEM_DEVICE_ONLY | ||
| mr_flags = flags & MCA_ACCELERATOR_FLAGS_UNIFIED_MEMORY ? 0 : | ||
| FI_HMEM_DEVICE_ONLY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aingerson you missed a ; after FI_HMEM_DEVICE_ONLY here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angainor Eek! Thank you! I'm concerned the CI didn't fail this... @bwbarrett Do you happen to know why this path is never built in the CI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The primary reason is because none of the OMPI devs approved the workflows (since you're not part of the OMPI org).
The secondary reason is that it looks like something has broken in CI and OMPI is no longer building with Libfabric support in the CI. I'll try to get this fixed next week while I'm OOTO. We haven't been great at running with latest Libfabric, so I should fix that, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah gotcha. Let me know if I can support that at all. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kicked off a PR CI for this PR in our internal jenkins which should test with Libfabric ... I will report back for the latest push
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest push of this PR passed AWS CI that runs with Libfabric built
opal/mca/btl/ofi/btl_ofi_module.c
Outdated
| } | ||
| #if OPAL_OFI_HAVE_FI_HMEM_DEVICE_ONLY | ||
| mr_flags = flags & MCA_ACCELERATOR_FLAGS_UNIFIED_MEMORY ? 0 : | ||
| FI_HMEM_DEVICE_ONLY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aingerson and here :)
|
Otherwise I tested the |
In OFI, the FI_HMEM_DEVICE_ONLY registration flag signals to the provider that the memory is only on the device and is not unified memory (which can migrate between the GPU and host). IPC is only usable with device only memory and is not valid for unified memory. Without this flag, providers cannot provide optimizations like IPC. Set the flag if the address was found to be non-unified memory. This enables IPC copies in OFI. The flag is available starting in v1.13.0 so this adds a configure check to make sure we only use it if available. This also includes an indentation fix and typos within the scope of the patch Signed-off-by: Zach Dworkin <[email protected]> Signed-off-by: Alexia Ingerson <[email protected]>
In OFI, the FI_HMEM_DEVICE_ONLY registration flag signals to the provider that the memory is only on the device and is not unified memory (which can migrate between the GPU and host). IPC is only usable with device only memory and is not valid for unified memory. Without this flag, providers cannot provide optimizations like IPC. Set the flag if the address was found to be non-unified memory. This enables IPC copies in OFI.
This also includes an indentation fix within the same function