keystore: add active health checker #61962

dboslee · 2025-12-03T16:56:59Z

the active health checker performs signing requests against CAs.

failed signing requests are reported back to a callback which will trip teleports readiness state.

the health checker watches CA events to ensure that the keys being tested are kept up-to-date.

this can be enabled by configuring

keystore:
  health_check:
    active:
      enabled: true

this config format leaves room for adding alternative health checking methods and additional parameters for the active health checker

the active health checker performs signing requests against CAs. failed signing requests are reported back to a callback which will trip teleports readiness state. the health checker watches CA events to ensure that the keys being tested are kept up-to-date. this can be enabled by configuring ```yaml keystore: health_check: active: enabled: true ``` this config format leaves room for adding alternative health checking methods and additional parameters for the active health checker

nklaassen · 2025-12-05T21:30:31Z

lib/auth/keystore/health/active_test.go

+	hc, err := NewActiveHealthChecker(ActiveHealthCheckConfig{
+		Callback: func(err error) {
+			calls = append(calls, err)
+			wg.Done()


seems this may be called twice, flaky test detector is catching a negative waitgroup counter

you could probably try using synctest here instead of a fake clock https://pkg.go.dev/testing/synctest, then you could just "sleep" instead of advancing the clock and if done right it should guarantee the health check runs before your test continues

👍 I'll give it a shot, I haven't messed with synctest before.

good to know this is an option, thanks. updated in 214f551

lib/auth/keystore/health/active.go

nklaassen · 2025-12-05T21:40:45Z

lib/auth/keystore/manager.go

+func (m *Manager) GetTLSSigner(ctx context.Context, ca types.CertAuthority) (crypto.Signer, error) {
+	_, signer, err := m.getTLSCertAndSigner(ctx, ca.GetActiveKeys())
+	return signer, trace.Wrap(err)
+}


couldn't you just call GetTLSCertAndSigner instead of adding this method?

Yeah I was trying to make the API more consistent but I could also just drop the cert in the health checker

doesn't really matter, but if you prefer to add the GetTLSSigner method, can you just move it right next to GetTLSCertAndSigner?

moved in 3b14f71

nklaassen · 2025-12-05T21:46:20Z

lib/services/watcher_test.go

+	}
+
+	caService := local.NewCAService(bk)
+	//nolint:staticcheck // SA1019 This should be updated to use [services.NewCertAuthorityWatcher]


this test of the deprecated watcher should not be updated though

I think this is just a bad diff, the deprecated watcher test is not modified.

I think it's right, this test should exercise the deprecated watcher, I just don't think the comment should say it should be updated (i assume it's a copy/paste error)

I miss understood. I will update to say something along the lines of "remove the test when the deprecated method is no longer used"

updated the lint comment in 6b773ab

nklaassen · 2025-12-05T22:08:52Z

lib/auth/keystore/health/active.go

+		select {
+		case <-ctx.Done():
+			return ctx.Err()
+		case <-ticker.Chan():
+		}


Why do you need the CA watcher instead of just reading all the CAs on every loop? The should all be able to be read from cache

I would expect a watcher to react to changes faster than polling, it also reduces unnecessary work of polling at an interval.

Isn't the watcher just an abstraction over the cache?

Not against changing it just trying to understand why we would prefer polling over receiving events.

sorry didn't consider where you were leaving this comment. Doing the polling within the run loop like you are suggesting wouldn't have the timing issue I was concerned about.

How would polling the cache behave vs a watcher if the backend is unreachable/unhealthy?

well, at first I thought you were using the watcher so that you could react faster to changes, but then I see here it still only checks on the ticker interval, and wondered what the watcher was for. But on second thought it's actually not clear when you would poll for CAs with this model where you check just one key every tick, and what you have probably works better. I was just surprised to see a whole new ca watcher, I guess it has some API the old one didn't have?

Yeah the main reason was to get the resource channel. with the old watcher you get raw events and have to handle tracking resources based on the event types (put/delete/etc...).

I figured we want to move to the new generic watcher anyways and this will make it more obvious/accessible for future use

fspmarshall · 2025-12-10T00:31:00Z

lib/auth/keystore/health/active.go

+	if last == nil {
+		return c.signers[0], nil
+	}
+	for i := range c.signers {
+		if c.signers[i].Equal(last) {
+			return c.signers[(i+1)%n], nil
+		}
+	}
+	return c.signers[0], nil


The process of selecting the next signer is a bit funky. The local signer store isn't sorted, and the API doesn't specify that the upstream data source sending the new signer list needs to send them sorted. Additionally, this seeking logic just always restarts at the beginning if it can't find where it left off.

Admittedly, the set of signers for a cluster is small and relatively static, so these may be non-issues. I doubt we'd actually have problems unless something else was going wrong, but if you find the time I think it would be better practice to do one of the following:

Sort the local signer store by <domain-name>/<kind> (or vise-versa) and then select the lexicographically next value. That way we don't need to rely on the upstream yielding a sorted sequence, and can continue where we left off in the face of removals.

Do away with stepping through one signer each "tick" in favor of just health checking all of them each tick. If load is a concern, adding a cpl hundred ms of rate limiting to the inner check loop would probably be fine.

good point, I will update to use the lexical sorting approach.

This is mostly intended for KMS integrations which have quota and cost implications. for example aws kms asymmetric crypto operations have a default 1000 req/second limit (for a region not per key) and costs $0.15 / 10000 requests.

at a rate of 1 req/minute the cost works out to $7.88 per key per year. Or just $7.88 since we're doing 1 key at a time.

we don't do it often but we did add two CAs in v18 and each CA can contain multiple keys so I'd rather keep the load/cost of this small and predictable.

dboslee added 5 commits December 3, 2025 11:55

improve godocs

9ad51ad

simplify logic

6db7adf

fix outdated comment

7fed09b

add file license header

ac46ad6

dboslee force-pushed the david/kms-healthcheck branch 3 times, most recently from 10c171e to 4ea3649 Compare December 4, 2025 02:33

fix lint

107e524

dboslee force-pushed the david/kms-healthcheck branch from 4ea3649 to 107e524 Compare December 4, 2025 19:08

dboslee added 3 commits December 5, 2025 14:06

return curr signer on health failure

41e468e

refactor watch loop to improve readability

e1dfbf0

fix comments

d7e281a

dboslee force-pushed the david/kms-healthcheck branch from 3d8584d to d7e281a Compare December 5, 2025 20:19

add context to deletekey godocs

3e0a81e

dboslee force-pushed the david/kms-healthcheck branch from 0771836 to 3e0a81e Compare December 5, 2025 20:38

dboslee marked this pull request as ready for review December 5, 2025 20:44

github-actions bot added the size/lg label Dec 5, 2025

github-actions bot requested review from fspmarshall, nklaassen and rudream December 5, 2025 20:44

nklaassen reviewed Dec 5, 2025

View reviewed changes

dboslee added 3 commits December 8, 2025 10:10

move GetTLSSigner closer to similar func def

3b14f71

use synctest and remove fake clock

214f551

update lint comment

6b773ab

fspmarshall approved these changes Dec 10, 2025

View reviewed changes

keystore: add active health checker #61962

Are you sure you want to change the base?

keystore: add active health checker #61962

Uh oh!

Conversation

dboslee commented Dec 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants