Skip to content

fix gin proxy load balance problem when nic is a lag port#2230

Open
YeSho-cpp wants to merge 1 commit into
NVIDIA:v2.30u1from
YeSho-cpp:dev
Open

fix gin proxy load balance problem when nic is a lag port#2230
YeSho-cpp wants to merge 1 commit into
NVIDIA:v2.30u1from
YeSho-cpp:dev

Conversation

@YeSho-cpp

Copy link
Copy Markdown

Description

Fix a QP load balance problem when using ROCE LAG (mode 4, 802.3 AD) with queue affinity policy by GIN proxy

Related Issues

#2136

Changes & Impact

Performance Impact

make gin proxy qp can full use of the dual uplink port of bonding ib device

@YeSho-cpp YeSho-cpp force-pushed the dev branch 4 times, most recently from 9f9b8e4 to 8342cdb Compare June 9, 2026 12:46
…y gin proxy

Signed-off-by: chenjiahui9 <chenjiahui9@xiaomi.com>
@YeSho-cpp YeSho-cpp changed the title fix load balance problem when nic is a lag port with queue … fix gin proxy load balance problem when nic is a lag port Jun 10, 2026
@xiaofanl-nvidia

Copy link
Copy Markdown
Collaborator

@YeSho-cpp did the recent rankStride change regress this case and your PR fixes the regression?

++ @sjeaugey , @jynv to mirror if the fix is reasonable.

@xiaofanl-nvidia xiaofanl-nvidia requested review from jynv and sjeaugey June 23, 2026 00:39
@YeSho-cpp

Copy link
Copy Markdown
Author

@YeSho-cpp did the recent rankStride change regress this case and your PR fixes the regression?

++ @sjeaugey , @jynv to mirror if the fix is reasonable.

These are two separate issues. The recent rankStride fix addresses connectionType in ncclDevCommCreate not taking effect in gin proxy (#2137). The previous PR (a93cc03) fixed the issue in gdaki mode, while this new PR resolves the load balancing problem in proxy mode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants