Skip to content

init: do not reject a config that sets only minCTAs or only maxCTAs#2256

Open
EylonKrause wants to merge 1 commit into
NVIDIA:masterfrom
EylonKrause:fix/commconfig-min-only-cta
Open

init: do not reject a config that sets only minCTAs or only maxCTAs#2256
EylonKrause wants to merge 1 commit into
NVIDIA:masterfrom
EylonKrause:fix/commconfig-min-only-cta

Conversation

@EylonKrause

Copy link
Copy Markdown

Description

parseCommConfig validates the minCTAs/maxCTAs config fields with three clauses. The first two guard the unset sentinel NCCL_CONFIG_UNDEF_INT before their <= 0 checks, but the third clause — internalConfigPtr->minCTAs > internalConfigPtr->maxCTAs — does not. Unset config fields are NCCL_CONFIG_UNDEF_INT (INT_MIN), so a config that sets only minCTAs (leaving maxCTAs unset) makes minCTAs > INT_MIN true, and ncclCommInitRankConfig / ncclCommSplit reject an otherwise-valid configuration with ncclInvalidArgument. minCTAs and maxCTAs are documented as independently settable, so setting just one should be allowed.

Related Issues

None.

Changes & Impact

  • src/init.cc (parseCommConfig): guard the minCTAs > maxCTAs comparison with the same NCCL_CONFIG_UNDEF_INT checks the two sibling clauses already use, so it only fires when both bounds are actually set. A genuinely inverted pair (e.g. minCTAs=16, maxCTAs=4) is still rejected. Non-breaking.

Performance Impact

None.

Testing

  • Builds clean with make src.build.
  • Single-GPU before/after with ncclCommInitRankConfig (1 rank), config.minCTAs=4, maxCTAs left unset:
    • Before: init.cc NCCL WARN Invalid config min/max channels attribute value 4/-2147483648 → returns ncclInvalidArgument (-2147483648 is the unset maxCTAs sentinel).
    • After: init returns ncclSuccess.

The min/maxCTAs validation in parseCommConfig guards the <=0 checks with
NCCL_CONFIG_UNDEF_INT but the minCTAs>maxCTAs check does not. Since unset
fields are NCCL_CONFIG_UNDEF_INT (INT_MIN), setting only minCTAs (maxCTAs
left unset) makes minCTAs>maxCTAs (>INT_MIN) true, so a valid config is
rejected with ncclInvalidArgument. Guard the comparison with the same
UNDEF checks used by the sibling clauses.

Signed-off-by: EylonKrause <eylon1909@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant