-
Notifications
You must be signed in to change notification settings - Fork 2k
keystore: add active health checker #61962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
the active health checker performs signing requests against CAs.
failed signing requests are reported back to a callback which will
trip teleports readiness state.
the health checker watches CA events to ensure that the keys being
tested are kept up-to-date.
this can be enabled by configuring
```yaml
keystore:
health_check:
active:
enabled: true
```
this config format leaves room for adding alternative health checking methods
and additional parameters for the active health checker
10c171e to
4ea3649
Compare
4ea3649 to
107e524
Compare
3d8584d to
d7e281a
Compare
0771836 to
3e0a81e
Compare
| hc, err := NewActiveHealthChecker(ActiveHealthCheckConfig{ | ||
| Callback: func(err error) { | ||
| calls = append(calls, err) | ||
| wg.Done() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems this may be called twice, flaky test detector is catching a negative waitgroup counter
you could probably try using synctest here instead of a fake clock https://pkg.go.dev/testing/synctest, then you could just "sleep" instead of advancing the clock and if done right it should guarantee the health check runs before your test continues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I'll give it a shot, I haven't messed with synctest before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good to know this is an option, thanks. updated in 214f551
lib/auth/keystore/manager.go
Outdated
| func (m *Manager) GetTLSSigner(ctx context.Context, ca types.CertAuthority) (crypto.Signer, error) { | ||
| _, signer, err := m.getTLSCertAndSigner(ctx, ca.GetActiveKeys()) | ||
| return signer, trace.Wrap(err) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couldn't you just call GetTLSCertAndSigner instead of adding this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was trying to make the API more consistent but I could also just drop the cert in the health checker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't really matter, but if you prefer to add the GetTLSSigner method, can you just move it right next to GetTLSCertAndSigner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved in 3b14f71
lib/services/watcher_test.go
Outdated
| } | ||
|
|
||
| caService := local.NewCAService(bk) | ||
| //nolint:staticcheck // SA1019 This should be updated to use [services.NewCertAuthorityWatcher] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test of the deprecated watcher should not be updated though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is just a bad diff, the deprecated watcher test is not modified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's right, this test should exercise the deprecated watcher, I just don't think the comment should say it should be updated (i assume it's a copy/paste error)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I miss understood. I will update to say something along the lines of "remove the test when the deprecated method is no longer used"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the lint comment in 6b773ab
| select { | ||
| case <-ctx.Done(): | ||
| return ctx.Err() | ||
| case <-ticker.Chan(): | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need the CA watcher instead of just reading all the CAs on every loop? The should all be able to be read from cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect a watcher to react to changes faster than polling, it also reduces unnecessary work of polling at an interval.
Isn't the watcher just an abstraction over the cache?
Not against changing it just trying to understand why we would prefer polling over receiving events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry didn't consider where you were leaving this comment. Doing the polling within the run loop like you are suggesting wouldn't have the timing issue I was concerned about.
How would polling the cache behave vs a watcher if the backend is unreachable/unhealthy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, at first I thought you were using the watcher so that you could react faster to changes, but then I see here it still only checks on the ticker interval, and wondered what the watcher was for. But on second thought it's actually not clear when you would poll for CAs with this model where you check just one key every tick, and what you have probably works better. I was just surprised to see a whole new ca watcher, I guess it has some API the old one didn't have?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the main reason was to get the resource channel. with the old watcher you get raw events and have to handle tracking resources based on the event types (put/delete/etc...).
I figured we want to move to the new generic watcher anyways and this will make it more obvious/accessible for future use
| if last == nil { | ||
| return c.signers[0], nil | ||
| } | ||
| for i := range c.signers { | ||
| if c.signers[i].Equal(last) { | ||
| return c.signers[(i+1)%n], nil | ||
| } | ||
| } | ||
| return c.signers[0], nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The process of selecting the next signer is a bit funky. The local signer store isn't sorted, and the API doesn't specify that the upstream data source sending the new signer list needs to send them sorted. Additionally, this seeking logic just always restarts at the beginning if it can't find where it left off.
Admittedly, the set of signers for a cluster is small and relatively static, so these may be non-issues. I doubt we'd actually have problems unless something else was going wrong, but if you find the time I think it would be better practice to do one of the following:
-
Sort the local signer store by
<domain-name>/<kind>(or vise-versa) and then select the lexicographically next value. That way we don't need to rely on the upstream yielding a sorted sequence, and can continue where we left off in the face of removals. -
Do away with stepping through one signer each "tick" in favor of just health checking all of them each tick. If load is a concern, adding a cpl hundred ms of rate limiting to the inner check loop would probably be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, I will update to use the lexical sorting approach.
This is mostly intended for KMS integrations which have quota and cost implications. for example aws kms asymmetric crypto operations have a default 1000 req/second limit (for a region not per key) and costs $0.15 / 10000 requests.
at a rate of 1 req/minute the cost works out to $7.88 per key per year. Or just $7.88 since we're doing 1 key at a time.
we don't do it often but we did add two CAs in v18 and each CA can contain multiple keys so I'd rather keep the load/cost of this small and predictable.
the active health checker performs signing requests against CAs.
failed signing requests are reported back to a callback which will trip teleports readiness state.
the health checker watches CA events to ensure that the keys being tested are kept up-to-date.
this can be enabled by configuring
this config format leaves room for adding alternative health checking methods and additional parameters for the active health checker