Skip to content

feature: ncp and nnp conditions#2628

Merged
dahalperin merged 1 commit into
Mellanox:masterfrom
dahalperin:master
Jun 8, 2026
Merged

feature: ncp and nnp conditions#2628
dahalperin merged 1 commit into
Mellanox:masterfrom
dahalperin:master

Conversation

@dahalperin

Copy link
Copy Markdown
Collaborator

No description provided.

@copy-pr-bot

copy-pr-bot Bot commented Jun 2, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@dahalperin dahalperin requested review from almaslennikov, e0ne, heyvister1 and rollandf and removed request for almaslennikov, e0ne and rollandf June 2, 2026 07:08
@greptile-apps

greptile-apps Bot commented Jun 2, 2026

Copy link
Copy Markdown

Greptile Summary

This PR adds status.conditions arrays to both NicClusterPolicy and NicNodePolicy, implementing the standard Kubernetes conditions pattern for both CRDs. A new computePolicyConditions helper derives per-component and aggregate Ready conditions from reconcile results, preserving lastTransitionTime across reconcile cycles via apimeta.SetStatusCondition.

  • API change: Conditions []metav1.Condition is added to both status structs with proper SSA list-map markers; ConditionHolder interface, per-component condition-type constants, and StateNameToConditionType mapping are introduced in a new file.
  • Controller wiring: updatePolicyCRStatus type-asserts cr to ConditionHolder and calls computePolicyConditions when the assertion succeeds, keeping the change backward-compatible with any CR that doesn't implement the interface.
  • Condition lifecycle: stale conditions (states absent from the current result snapshot) are pruned; SyncStateIgnore components are fully omitted from status.conditions; error takes priority over not-ready in the aggregate Ready condition.

Confidence Score: 5/5

This PR is safe to merge; it adds a new, additive status field and keeps all existing status fields intact.

The conditions implementation follows the standard Kubernetes apimeta.SetStatusCondition pattern correctly, stale-condition pruning is exercised by dedicated tests, and the change is purely additive — no existing status fields or reconcile paths are modified. The ConditionHolder type assertion in updatePolicyCRStatus is guarded and backward-compatible.

No files require special attention; the core logic in controllers/conditions.go is well-tested and straightforward.

Important Files Changed

Filename Overview
controllers/conditions.go New file implementing computePolicyConditions; logic for SyncStateIgnore removal, stale-condition pruning, and aggregate Ready derivation is correct and well-commented.
controllers/conditions_test.go Good unit-test coverage for the main happy and error paths; SyncStateReset (default branch) has no dedicated test case.
api/v1alpha1/nicclusterpolicyconditions.go New file defining condition-type constants, reason constants, StateNameToConditionType map, and ConditionHolder interface; the comment about NicNodePolicy states (previously flagged) is the only concern.
controllers/nic_policy_helpers.go Wires computePolicyConditions into updatePolicyCRStatus via a safe ConditionHolder type-assertion; no logic issues.
api/v1alpha1/nicclusterpolicy_types.go Adds Conditions field with correct SSA/patch markers and implements GetConditions/SetConditions for ConditionHolder.
api/v1alpha1/nicnodepolicy_types.go Mirrors NicClusterPolicy changes; Conditions field and ConditionHolder implementation look correct.
api/v1alpha1/zz_generated.deepcopy.go Auto-generated deep-copy for Conditions slice using DeepCopyInto per element; correct.
config/crd/bases/mellanox.com_nicclusterpolicies.yaml CRD schema updated with standard metav1.Condition fields and correct x-kubernetes-list-type/map-keys markers.
config/crd/bases/mellanox.com_nicnodepolicies.yaml Identical conditions schema addition as NicClusterPolicy CRD; looks correct.
deployment/network-operator/crds/mellanox.com_nicclusterpolicies.yaml Deployment CRD kept in sync with config/crd/bases counterpart; changes are identical.
deployment/network-operator/crds/mellanox.com_nicnodepolicies.yaml Deployment CRD kept in sync with config/crd/bases counterpart; changes are identical.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[updatePolicyCRStatus] --> B[SetAppliedStates and SetPolicyState]
    B --> C{cr implements ConditionHolder?}
    C -->|No| G[Status Update]
    C -->|Yes| D[computePolicyConditions]
    D --> E[Iterate StatesStatus]
    E --> F{Sync Status}
    F -->|SyncStateReady| R[condition True / ComponentReady]
    F -->|SyncStateIgnore| S[RemoveStatusCondition + mark seen]
    F -->|SyncStateError| T[condition False / ComponentError, hasError=true]
    F -->|default NotReady or Reset| U[condition False / ComponentNotReady, hasNotReady=true]
    R --> V[SetStatusCondition]
    T --> V
    U --> V
    S --> W[Prune unseen mapped conditions]
    V --> W
    W --> X{hasError?}
    X -->|Yes| Y[Ready=False / ComponentError]
    X -->|No| Z{hasNotReady?}
    Z -->|Yes| AA[Ready=False / ComponentsNotReady]
    Z -->|No| BB[Ready=True / AllComponentsReady]
    Y --> CC[SetStatusCondition Ready]
    AA --> CC
    BB --> CC
    CC --> DD[SetConditions on CR]
    DD --> G
Loading

Reviews (2): Last reviewed commit: "feature: ncp and nnp conditions" | Re-trigger Greptile

Comment on lines +63 to +66
// StateNameToConditionType maps the Name() of a pkg/state.State to the
// corresponding NicClusterPolicy condition type constant.
// States absent from this map (e.g. states used only by NicNodePolicy) produce
// no condition entry.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The comment is factually incorrect. Looking at pkg/state/factory.go, newNicNodePolicyStates returns only {ofedState, sharedDpState, sriovDpState} — all three of those state names ("state-OFED", "state-RDMA-device-plugin", "state-SRIOV-device-plugin") are already present in this map. There are no NicNodePolicy states that are absent from the map, so the parenthetical example misleads future readers into thinking NicNodePolicy has its own unmapped states.

Suggested change
// StateNameToConditionType maps the Name() of a pkg/state.State to the
// corresponding NicClusterPolicy condition type constant.
// States absent from this map (e.g. states used only by NicNodePolicy) produce
// no condition entry.
// StateNameToConditionType maps the Name() of a pkg/state.State to the
// corresponding condition type constant used by both NicClusterPolicy and NicNodePolicy.
// States absent from this map produce no condition entry and do not affect the
// aggregate Ready condition.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dahalperin Please fix comment as suggested

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread controllers/conditions.go
Comment on lines +103 to +116
switch {
case hasError:
ready.Status = metav1.ConditionFalse
ready.Reason = mellanoxv1alpha1.ConditionReasonComponentError
ready.Message = "One or more components are in error state"
case hasNotReady:
ready.Status = metav1.ConditionFalse
ready.Reason = mellanoxv1alpha1.ConditionReasonComponentsNotReady
ready.Message = "One or more components are not yet ready"
default:
ready.Status = metav1.ConditionTrue
ready.Reason = mellanoxv1alpha1.ConditionReasonAllComponentsReady
ready.Message = ""
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Aggregate Ready=True when every component is in SyncStateIgnore

When a policy has no configured components (all states return SyncStateIgnore), neither hasError nor hasNotReady is set, so the Ready condition is emitted as True / AllComponentsReady. A freshly created NicNodePolicy or NicClusterPolicy with an empty spec would immediately surface Ready=True even though nothing has been deployed. The test TestComputePolicyConditionsIgnoredComponentsOmitted codifies this choice, but it is worth confirming it is intentional — an Unknown status for the initial/empty case is a common Kubernetes pattern and may be less surprising to consumers of this condition.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional. Ready=True means all configured components are ready

Signed-off-by: Dana Halperin <dahalperin@nvidia.com>
@dahalperin

Copy link
Copy Markdown
Collaborator Author

/retest-nic_operator_kind

1 similar comment
@dahalperin

Copy link
Copy Markdown
Collaborator Author

/retest-nic_operator_kind

@dahalperin dahalperin merged commit cdcc754 into Mellanox:master Jun 8, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants