Skip to content

Emit consecutive-refresh-failure counter on /api/health #20

@GavT

Description

@GavT

Summary

DataCache currently exposes isStale: Boolean via /api/health, which answers "is the cache currently fresh?" but doesn't distinguish "one transient failure 35 minutes ago" from "5 consecutive failures over the last 2.5 hours". For alerting we want the latter.

Proposed change

Add a counter to DataCache that:

  • Increments on each refresh failure (initial or periodic, station or price)
  • Resets to 0 on each successful refresh
  • Is exposed on /api/health as a new field, e.g. consecutiveRefreshFailures: Int
  • Is logged at WARN level on each failure: Refresh failed (N consecutive failures)

HealthResponse gets one new field; Routes.kt plumbs it through; DataCache.start()'s try/catch blocks increment / reset it appropriately.

Why it matters

Pairs with #18 (UptimeRobot poller). With the counter exposed, the monitor can alert on consecutiveRefreshFailures >= 3 rather than dataLoaded: false, which:

  1. Avoids false positives during the periodic window. The default refresh interval is 30 min and the stale threshold is 90 min, so a single failed cycle does NOT make the cache stale. Alerting on isStale only fires after 90 min of consecutive failures. Alerting on >= 3 consecutive failures fires after ~90 min too, but with explicit semantics.
  2. Distinguishes blips from outages. A single 5xx that the next cycle recovers from is normal. Three in a row is an incident.

Out of scope

Full metrics export (Prometheus, CloudWatch). This issue is just a counter on /api/health so external monitors can read it. Real metrics is a separate, larger workstream.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Hardening / refinement / nice-to-have — not blocking launchenhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions