Summary
DataCache currently exposes isStale: Boolean via /api/health, which answers "is the cache currently fresh?" but doesn't distinguish "one transient failure 35 minutes ago" from "5 consecutive failures over the last 2.5 hours". For alerting we want the latter.
Proposed change
Add a counter to DataCache that:
- Increments on each refresh failure (initial or periodic, station or price)
- Resets to 0 on each successful refresh
- Is exposed on
/api/health as a new field, e.g. consecutiveRefreshFailures: Int
- Is logged at WARN level on each failure:
Refresh failed (N consecutive failures)
HealthResponse gets one new field; Routes.kt plumbs it through; DataCache.start()'s try/catch blocks increment / reset it appropriately.
Why it matters
Pairs with #18 (UptimeRobot poller). With the counter exposed, the monitor can alert on consecutiveRefreshFailures >= 3 rather than dataLoaded: false, which:
- Avoids false positives during the periodic window. The default refresh interval is 30 min and the stale threshold is 90 min, so a single failed cycle does NOT make the cache stale. Alerting on
isStale only fires after 90 min of consecutive failures. Alerting on >= 3 consecutive failures fires after ~90 min too, but with explicit semantics.
- Distinguishes blips from outages. A single 5xx that the next cycle recovers from is normal. Three in a row is an incident.
Out of scope
Full metrics export (Prometheus, CloudWatch). This issue is just a counter on /api/health so external monitors can read it. Real metrics is a separate, larger workstream.
Related
Summary
DataCachecurrently exposesisStale: Booleanvia/api/health, which answers "is the cache currently fresh?" but doesn't distinguish "one transient failure 35 minutes ago" from "5 consecutive failures over the last 2.5 hours". For alerting we want the latter.Proposed change
Add a counter to
DataCachethat:/api/healthas a new field, e.g.consecutiveRefreshFailures: IntRefresh failed (N consecutive failures)HealthResponsegets one new field;Routes.ktplumbs it through;DataCache.start()'s try/catch blocks increment / reset it appropriately.Why it matters
Pairs with #18 (UptimeRobot poller). With the counter exposed, the monitor can alert on
consecutiveRefreshFailures >= 3rather thandataLoaded: false, which:isStaleonly fires after 90 min of consecutive failures. Alerting on>= 3 consecutive failuresfires after ~90 min too, but with explicit semantics.Out of scope
Full metrics export (Prometheus, CloudWatch). This issue is just a counter on
/api/healthso external monitors can read it. Real metrics is a separate, larger workstream.Related