-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Description
We should introduce a circuit breaker mechanism per subgraph endpoint to protect the router from cascading failures when a subgraph service starts misbehaving (timeouts, 5xx errors, etc).
Basically, when a subgraph keeps failing or timing out above a certain threshold, the router should trip the circuit for that target - fast-failing requests for a short cooldown period instead of queuing or retrying forever.
Then after cooldown, the circuit goes half-open, sends a few probe requests, and either closes (if healthy) or opens again.
This should be configurable per subgraph and integrated into the router metrics (prom metrics to allow users to observe it) and config system.
Benefits:
- Prevent a single bad subgraph from tanking overall request latency
- Protect connection pools from overload
dotansimha
Metadata
Metadata
Assignees
Labels
No labels