Skip to content

APIGOV-31191 validate cache#1015

Open
alrosca wants to merge 4 commits intomainfrom
APIGOV-31191
Open

APIGOV-31191 validate cache#1015
alrosca wants to merge 4 commits intomainfrom
APIGOV-31191

Conversation

@alrosca
Copy link
Collaborator

@alrosca alrosca commented Mar 10, 2026

Cache validation on start up and reconnect

jcollins-axway
jcollins-axway previously approved these changes Mar 11, 2026
Copy link
Collaborator

@jcollins-axway jcollins-axway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from a high level, needs pounded testing wise

@jcollins-axway
Copy link
Collaborator

copilots review

Code Review: APIGOV-31191 vs main

Summary

  • Scope: reviewed 8 files (314 insertions, 4 deletions), focused on new cache validation flow, reconnect hooks, and cache manager interfaces.
  • Verdict: request changes
  • Severity: medium (2), low (0), high (0), blocker (0)

Findings

  • Medium — Cache validation uses non-unique key (name) and can collide across scoped resources

  • Medium — Cache validator hard-codes API version to v1alpha1, which can skip/incorrectly query some filters

    • Where: pkg/agent/cachevalidationjob.go, pkg/agent/cachevalidationjob.go
    • Description: validation URL is built from a synthetic ResourceInstance with fixed APIVersion: "v1alpha1". For filters whose resource version differs, GetKindLink() may produce the wrong endpoint or empty link, and validation is skipped (return true).
    • Possible impacts to the codebase: out-of-sync persisted cache may pass validation and remain in use after startup/reconnect for affected resource kinds.

Positives

  • Cache validation logic is cleanly encapsulated (cacheValidator) and integrated into both startup and reconnect paths.
  • Reconnect hooks were added consistently to poll and stream clients with minimal API surface change (WithOnReconnect options).
  • Logging around validation failures includes useful operational context (resource counts, resource names, timestamps).

Recommendations

  • Use a stable unique key for comparisons (for example selfLink or a composite of group/kind/scope/name) instead of name alone.
  • Derive API version from filter/resource metadata or from server-discovered GVK mappings rather than hard-coding "v1alpha1".
  • Add targeted unit tests for: duplicate names across scopes, non-v1alpha1 resources, and reconnect-triggered validation behavior.

@alrosca alrosca closed this Mar 12, 2026
@alrosca alrosca reopened this Mar 12, 2026
@alrosca
Copy link
Collaborator Author

alrosca commented Mar 12, 2026

Not a bad review at all

// deleting state with success status - should NOT be cached
err := handler.Handle(NewEventContext(proto.Event_CREATED, nil, ri.Kind, ri.Name), nil, ri)
assert.Nil(t, err)
assert.Equal(t, []string{}, cm.GetAccessRequestCacheKeys())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as we spoke of we prefer test cases over single tests

jcollins-axway
jcollins-axway previously approved these changes Mar 12, 2026
jcollins-axway
jcollins-axway previously approved these changes Mar 13, 2026
@jcollins-axway
Copy link
Collaborator

caching changes require an extra amount of testing as its usually to blame for our duplicate service issues

@sbolosan
Copy link
Collaborator

Give me time to look at this closer

return false
}

cachedResources := cv.cacheMan.GetCachedResourcesByKind(filter.Group, filter.Kind)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we talked about this scope mismatch in sync. The server query being used is scoped to filter.Scope.Name because of GetKindLink. So its returning resources for that one envionrment,right?

but GetCachedResourcesByKind is returning every cached resource for this kind across all scopes. Different scopes - if len(serverMap) != len(cachedResources) { will not work because the cache side will be bigger than the server side.


// validateAndRebuildCache validates the cache and rebuilds it if out of sync.
// Called when connection to Engage is restored.
func (es *EventSync) validateAndRebuildCache() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is being called by onReconnect() while the event listener is already live. PublisingLock doesn't look to be guarding any event consumption. So Flush and SetSequence could possibly race with the listener writing to cache, no? And maybe messing up the sequence. I ran into this with my mulesoft stuff. Like the listener should be paused or wait before rebuilding. I despise race conditions. ugh

}

func (es *EventSync) RebuildCache() {
// SDB - NOTE : Do we need to pause jobs.
Copy link
Collaborator

@sbolosan sbolosan Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dang! @vivekschauhan @jcollins-axway I put this note in a very long time ago ;).
I was looking at another race condition down stream, but decided to see who called RebuildCache().

So RebuildCache is reachable by WithOnClientStop (poller) and With EventSyncError (stream) while the event listener go routine is alive, well technically alive.

In those paths the race is meh, maybe not a big deal because event flow has stopped like a harvester error or when the stream is not yet producing. But PublishingLock still doesn't coordinate with the listener. Maybe we should take a look at this again?

agentStatus := newDAStatus(agentInstance.ResourceMeta, status, prevStatus, message)

// See if we need to rebuildCache
timeToRebuild, _ := a.shouldRebuildCache()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we eat the error?
At least log the error/warn and then if requirement is to rebuild, then rebuild?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe something like

    timeToRebuild, err := a.shouldRebuildCache()
    if err != nil {
        a.logger.WithError(err).Warn("unable to determine cache rebuild state, triggering rebuild")
        timeToRebuild = true
    }
    if timeToRebuild && a.rebuildCache != nil {
        a.rebuildCache.RebuildCache()
    }

if value != nil {
logger := a.logger.WithField("cacheUpdateTime", value)
// get current cacheUpdateTime from x-agent-details
convToTimestamp, err := strconv.ParseInt(value.(string), 10, 64)
Copy link
Collaborator

@sbolosan sbolosan Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is old and not part of your changes, but can we do a safe guard here

strVal, ok := value.(string)
            if !ok {
                logger.Warn("cacheUpdateTime is not a string, triggering rebuild")
                return true, nil
            }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants