Currently we do: 1. require more than X servers to report down, then we open incident. 2. To close incident, those exactly same X servers has to report up. This cause isseus when we have an outlier such as server with bad network. What we should do: 1. require more than X servers to report down, then we open incident. 2. require more than X servers to report up, then we close incident.
Currently we do:
This cause isseus when we have an outlier such as server with bad network.
What we should do: