APIGOV-31191 validate cache by alrosca · Pull Request #1015 · Axway/agent-sdk

alrosca · 2026-03-10T12:10:10Z

Cache validation on start up and reconnect

jcollins-axway

Looks good from a high level, needs pounded testing wise

jcollins-axway · 2026-03-11T23:43:02Z

copilots review

Code Review: APIGOV-31191 vs main

Summary

Scope: reviewed 8 files (314 insertions, 4 deletions), focused on new cache validation flow, reconnect hooks, and cache manager interfaces.
Verdict: request changes
Severity: medium (2), low (0), high (0), blocker (0)

Findings

Medium — Cache validation uses non-unique key (name) and can collide across scoped resources
- Where: pkg/agent/cache/cachevalidation.go, pkg/agent/cache/cachevalidation.go, pkg/agent/cachevalidationjob.go, pkg/agent/cachevalidationjob.go
- Description: both server and cache summaries are keyed only by resource.Name. If multiple resources share the same name under different scopes, entries overwrite each other, making the comparison unreliable.
- Possible impacts to the codebase: false negatives (stale cache undetected) or false positives (unnecessary full cache rebuilds), especially in multi-scope environments.
Medium — Cache validator hard-codes API version to v1alpha1, which can skip/incorrectly query some filters
- Where: pkg/agent/cachevalidationjob.go, pkg/agent/cachevalidationjob.go
- Description: validation URL is built from a synthetic ResourceInstance with fixed APIVersion: "v1alpha1". For filters whose resource version differs, GetKindLink() may produce the wrong endpoint or empty link, and validation is skipped (return true).
- Possible impacts to the codebase: out-of-sync persisted cache may pass validation and remain in use after startup/reconnect for affected resource kinds.

Positives

Cache validation logic is cleanly encapsulated (cacheValidator) and integrated into both startup and reconnect paths.
Reconnect hooks were added consistently to poll and stream clients with minimal API surface change (WithOnReconnect options).
Logging around validation failures includes useful operational context (resource counts, resource names, timestamps).

Recommendations

Use a stable unique key for comparisons (for example selfLink or a composite of group/kind/scope/name) instead of name alone.
Derive API version from filter/resource metadata or from server-discovered GVK mappings rather than hard-coding "v1alpha1".
Add targeted unit tests for: duplicate names across scopes, non-v1alpha1 resources, and reconnect-triggered validation behavior.

alrosca · 2026-03-12T11:30:47Z

Not a bad review at all

jcollins-axway · 2026-03-12T20:13:35Z

pkg/agent/handler/discoveryaccessrequest_test.go

+	// deleting state with success status - should NOT be cached
+	err := handler.Handle(NewEventContext(proto.Event_CREATED, nil, ri.Kind, ri.Name), nil, ri)
+	assert.Nil(t, err)
+	assert.Equal(t, []string{}, cm.GetAccessRequestCacheKeys())


as we spoke of we prefer test cases over single tests

jcollins-axway · 2026-03-13T13:52:06Z

caching changes require an extra amount of testing as its usually to blame for our duplicate service issues

sbolosan · 2026-03-25T02:36:24Z

Give me time to look at this closer

sbolosan · 2026-03-25T21:52:08Z

pkg/agent/cachevalidationjob.go

+		return false
+	}
+
+	cachedResources := cv.cacheMan.GetCachedResourcesByKind(filter.Group, filter.Kind)


I think we talked about this scope mismatch in sync. The server query being used is scoped to filter.Scope.Name because of GetKindLink. So its returning resources for that one envionrment,right?

but GetCachedResourcesByKind is returning every cached resource for this kind across all scopes. Different scopes - if len(serverMap) != len(cachedResources) { will not work because the cache side will be bigger than the server side.

sbolosan · 2026-03-25T22:02:27Z

pkg/agent/eventsync.go

+
+// validateAndRebuildCache validates the cache and rebuilds it if out of sync.
+// Called when connection to Engage is restored.
+func (es *EventSync) validateAndRebuildCache() {


so this is being called by onReconnect() while the event listener is already live. PublisingLock doesn't look to be guarding any event consumption. So Flush and SetSequence could possibly race with the listener writing to cache, no? And maybe messing up the sequence. I ran into this with my mulesoft stuff. Like the listener should be paused or wait before rebuilding. I despise race conditions. ugh

sbolosan · 2026-03-25T22:11:21Z

pkg/agent/eventsync.go

 }

 func (es *EventSync) RebuildCache() {
 	// SDB - NOTE : Do we need to pause jobs.


Dang! @vivekschauhan @jcollins-axway I put this note in a very long time ago ;).
I was looking at another race condition down stream, but decided to see who called RebuildCache().

So RebuildCache is reachable by WithOnClientStop (poller) and With EventSyncError (stream) while the event listener go routine is alive, well technically alive.

In those paths the race is meh, maybe not a big deal because event flow has stopped like a harvester error or when the stream is not yet producing. But PublishingLock still doesn't coordinate with the listener. Maybe we should take a look at this again?

sbolosan · 2026-03-25T22:17:41Z

pkg/agent/resource/manager.go

 	agentStatus := newDAStatus(agentInstance.ResourceMeta, status, prevStatus, message)

 	// See if we need to rebuildCache
 	timeToRebuild, _ := a.shouldRebuildCache()


should we eat the error?
At least log the error/warn and then if requirement is to rebuild, then rebuild?

maybe something like

timeToRebuild, err := a.shouldRebuildCache() if err != nil { a.logger.WithError(err).Warn("unable to determine cache rebuild state, triggering rebuild") timeToRebuild = true } if timeToRebuild && a.rebuildCache != nil { a.rebuildCache.RebuildCache() }

sbolosan · 2026-03-25T22:24:03Z

pkg/agent/resource/manager.go

 		if value != nil {
 			logger := a.logger.WithField("cacheUpdateTime", value)
 			// get current cacheUpdateTime from x-agent-details
 			convToTimestamp, err := strconv.ParseInt(value.(string), 10, 64)


I know this is old and not part of your changes, but can we do a safe guard here

strVal, ok := value.(string) if !ok { logger.Warn("cacheUpdateTime is not a string, triggering rebuild") return true, nil }

APIGOV-31191 validate cache

2ca1d4f

alrosca requested review from dfeldick, dgghinea, jcollins-axway, sbolosan and vivekschauhan as code owners March 10, 2026 12:10

jcollins-axway previously approved these changes Mar 11, 2026

View reviewed changes

alrosca closed this Mar 12, 2026

alrosca reopened this Mar 12, 2026

APIGOV-31191 improvements and unit tests

6ae0ee4

alrosca dismissed jcollins-axway’s stale review via 6ae0ee4 March 12, 2026 11:45

jcollins-axway reviewed Mar 12, 2026

View reviewed changes

jcollins-axway previously approved these changes Mar 12, 2026

View reviewed changes

APIGOV-31191 refactor tests

5a8e3d8

alrosca dismissed jcollins-axway’s stale review via 5a8e3d8 March 13, 2026 09:59

jcollins-axway previously approved these changes Mar 13, 2026

View reviewed changes

APIGOV-31191 validate instead of rebuild after 7 days

10c1aca

alrosca dismissed jcollins-axway’s stale review via 10c1aca March 24, 2026 13:38

sbolosan reviewed Mar 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

APIGOV-31191 validate cache#1015

APIGOV-31191 validate cache#1015
alrosca wants to merge 4 commits intomainfrom
APIGOV-31191

alrosca commented Mar 10, 2026

Uh oh!

jcollins-axway left a comment

Uh oh!

jcollins-axway commented Mar 11, 2026

Uh oh!

alrosca commented Mar 12, 2026

Uh oh!

jcollins-axway Mar 12, 2026

Uh oh!

jcollins-axway commented Mar 13, 2026

Uh oh!

sbolosan commented Mar 25, 2026

Uh oh!

sbolosan Mar 25, 2026

Uh oh!

sbolosan Mar 25, 2026

Uh oh!

sbolosan Mar 25, 2026 •

edited

Loading

Uh oh!

sbolosan Mar 25, 2026

Uh oh!

sbolosan Mar 25, 2026

Uh oh!

sbolosan Mar 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

alrosca commented Mar 10, 2026

Uh oh!

jcollins-axway left a comment

Choose a reason for hiding this comment

Uh oh!

jcollins-axway commented Mar 11, 2026

Code Review: APIGOV-31191 vs main

Summary

Findings

Positives

Recommendations

Uh oh!

alrosca commented Mar 12, 2026

Uh oh!

jcollins-axway Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

jcollins-axway commented Mar 13, 2026

Uh oh!

sbolosan commented Mar 25, 2026

Uh oh!

sbolosan Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

sbolosan Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

sbolosan Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sbolosan Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

sbolosan Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

sbolosan Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sbolosan Mar 25, 2026 •

edited

Loading

sbolosan Mar 25, 2026 •

edited

Loading