Skip to content

Conversation

@dboslee
Copy link
Contributor

@dboslee dboslee commented Dec 3, 2025

the active health checker performs signing requests against CAs.

failed signing requests are reported back to a callback which will trip teleports readiness state.

the health checker watches CA events to ensure that the keys being tested are kept up-to-date.

this can be enabled by configuring

keystore:
  health_check:
    active:
      enabled: true

this config format leaves room for adding alternative health checking methods and additional parameters for the active health checker

the active health checker performs signing requests against CAs.

failed signing requests are reported back to a callback which will
trip teleports readiness state.

the health checker watches CA events to ensure that the keys being
tested are kept up-to-date.

this can be enabled by configuring

```yaml
keystore:
  health_check:
    active:
      enabled: true

```

this config format leaves room for adding alternative health checking methods
and additional parameters for the active health checker
@dboslee dboslee force-pushed the david/kms-healthcheck branch 3 times, most recently from 10c171e to 4ea3649 Compare December 4, 2025 02:33
@dboslee dboslee force-pushed the david/kms-healthcheck branch from 4ea3649 to 107e524 Compare December 4, 2025 19:08
@dboslee dboslee force-pushed the david/kms-healthcheck branch from 3d8584d to d7e281a Compare December 5, 2025 20:19
@dboslee dboslee force-pushed the david/kms-healthcheck branch from 0771836 to 3e0a81e Compare December 5, 2025 20:38
@dboslee dboslee marked this pull request as ready for review December 5, 2025 20:44
hc, err := NewActiveHealthChecker(ActiveHealthCheckConfig{
Callback: func(err error) {
calls = append(calls, err)
wg.Done()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems this may be called twice, flaky test detector is catching a negative waitgroup counter

you could probably try using synctest here instead of a fake clock https://pkg.go.dev/testing/synctest, then you could just "sleep" instead of advancing the clock and if done right it should guarantee the health check runs before your test continues

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I'll give it a shot, I haven't messed with synctest before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to know this is an option, thanks. updated in 214f551

Comment on lines 319 to 322
func (m *Manager) GetTLSSigner(ctx context.Context, ca types.CertAuthority) (crypto.Signer, error) {
_, signer, err := m.getTLSCertAndSigner(ctx, ca.GetActiveKeys())
return signer, trace.Wrap(err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't you just call GetTLSCertAndSigner instead of adding this method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was trying to make the API more consistent but I could also just drop the cert in the health checker

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't really matter, but if you prefer to add the GetTLSSigner method, can you just move it right next to GetTLSCertAndSigner?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved in 3b14f71

}

caService := local.NewCAService(bk)
//nolint:staticcheck // SA1019 This should be updated to use [services.NewCertAuthorityWatcher]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test of the deprecated watcher should not be updated though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is just a bad diff, the deprecated watcher test is not modified.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's right, this test should exercise the deprecated watcher, I just don't think the comment should say it should be updated (i assume it's a copy/paste error)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I miss understood. I will update to say something along the lines of "remove the test when the deprecated method is no longer used"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the lint comment in 6b773ab

Comment on lines 158 to 162
select {
case <-ctx.Done():
return ctx.Err()
case <-ticker.Chan():
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need the CA watcher instead of just reading all the CAs on every loop? The should all be able to be read from cache

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect a watcher to react to changes faster than polling, it also reduces unnecessary work of polling at an interval.

Isn't the watcher just an abstraction over the cache?

Not against changing it just trying to understand why we would prefer polling over receiving events.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry didn't consider where you were leaving this comment. Doing the polling within the run loop like you are suggesting wouldn't have the timing issue I was concerned about.

How would polling the cache behave vs a watcher if the backend is unreachable/unhealthy?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, at first I thought you were using the watcher so that you could react faster to changes, but then I see here it still only checks on the ticker interval, and wondered what the watcher was for. But on second thought it's actually not clear when you would poll for CAs with this model where you check just one key every tick, and what you have probably works better. I was just surprised to see a whole new ca watcher, I guess it has some API the old one didn't have?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the main reason was to get the resource channel. with the old watcher you get raw events and have to handle tracking resources based on the event types (put/delete/etc...).

I figured we want to move to the new generic watcher anyways and this will make it more obvious/accessible for future use

Comment on lines +206 to +214
if last == nil {
return c.signers[0], nil
}
for i := range c.signers {
if c.signers[i].Equal(last) {
return c.signers[(i+1)%n], nil
}
}
return c.signers[0], nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The process of selecting the next signer is a bit funky. The local signer store isn't sorted, and the API doesn't specify that the upstream data source sending the new signer list needs to send them sorted. Additionally, this seeking logic just always restarts at the beginning if it can't find where it left off.

Admittedly, the set of signers for a cluster is small and relatively static, so these may be non-issues. I doubt we'd actually have problems unless something else was going wrong, but if you find the time I think it would be better practice to do one of the following:

  • Sort the local signer store by <domain-name>/<kind> (or vise-versa) and then select the lexicographically next value. That way we don't need to rely on the upstream yielding a sorted sequence, and can continue where we left off in the face of removals.

  • Do away with stepping through one signer each "tick" in favor of just health checking all of them each tick. If load is a concern, adding a cpl hundred ms of rate limiting to the inner check loop would probably be fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, I will update to use the lexical sorting approach.

This is mostly intended for KMS integrations which have quota and cost implications. for example aws kms asymmetric crypto operations have a default 1000 req/second limit (for a region not per key) and costs $0.15 / 10000 requests.

at a rate of 1 req/minute the cost works out to $7.88 per key per year. Or just $7.88 since we're doing 1 key at a time.

we don't do it often but we did add two CAs in v18 and each CA can contain multiple keys so I'd rather keep the load/cost of this small and predictable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants