Skip to content

monitor lifecycle conductor#2723

Open
benzekrimaha wants to merge 12 commits intodevelopment/9.3from
improvement/BB-740-monitor-lifecycle-conductor
Open

monitor lifecycle conductor#2723
benzekrimaha wants to merge 12 commits intodevelopment/9.3from
improvement/BB-740-monitor-lifecycle-conductor

Conversation

@benzekrimaha
Copy link
Copy Markdown
Contributor

@benzekrimaha benzekrimaha commented Mar 2, 2026

Issue: BB-740

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Mar 2, 2026

Hello benzekrimaha,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Mar 2, 2026

Incorrect fix version

The Fix Version/s in issue BB-740 contains:

  • 9.3.0

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

  • 9.3.1

Please check the Fix Version/s of BB-740, or the target
branch of this pull request.

@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from 380069a to 25ea9d5 Compare March 2, 2026 16:32
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 2, 2026

Codecov Report

❌ Patch coverage is 82.79570% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.45%. Comparing base (145f7a6) to head (c946f41).
⚠️ Report is 25 commits behind head on development/9.3.

Files with missing lines Patch % Lines
extensions/lifecycle/LifecycleMetrics.js 75.67% 9 Missing ⚠️
...ecycle/bucketProcessor/LifecycleBucketProcessor.js 72.72% 3 Missing ⚠️
...sions/lifecycle/tasks/LifecycleDeleteObjectTask.js 66.66% 1 Missing ⚠️
extensions/lifecycle/tasks/LifecycleTask.js 66.66% 1 Missing ⚠️
...s/lifecycle/tasks/LifecycleUpdateExpirationTask.js 66.66% 1 Missing ⚠️
...s/lifecycle/tasks/LifecycleUpdateTransitionTask.js 66.66% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

Files with missing lines Coverage Δ
...tensions/lifecycle/conductor/LifecycleConductor.js 84.30% <100.00%> (+0.59%) ⬆️
extensions/lifecycle/tasks/LifecycleTaskV2.js 88.88% <ø> (ø)
...sions/lifecycle/tasks/LifecycleDeleteObjectTask.js 92.25% <66.66%> (-0.51%) ⬇️
extensions/lifecycle/tasks/LifecycleTask.js 91.41% <66.66%> (-0.14%) ⬇️
...s/lifecycle/tasks/LifecycleUpdateExpirationTask.js 80.51% <66.66%> (-0.57%) ⬇️
...s/lifecycle/tasks/LifecycleUpdateTransitionTask.js 91.17% <66.66%> (-0.75%) ⬇️
...ecycle/bucketProcessor/LifecycleBucketProcessor.js 79.41% <72.72%> (-0.47%) ⬇️
extensions/lifecycle/LifecycleMetrics.js 89.24% <75.67%> (-8.97%) ⬇️

... and 5 files with indirect coverage changes

Components Coverage Δ
Bucket Notification 80.37% <ø> (ø)
Core Library 80.57% <ø> (-0.13%) ⬇️
Ingestion 71.14% <ø> (ø)
Lifecycle 78.63% <82.79%> (+0.01%) ⬆️
Oplog Populator 85.83% <ø> (ø)
Replication 59.61% <ø> (-0.04%) ⬇️
Bucket Scanner 85.76% <ø> (ø)
@@                 Coverage Diff                 @@
##           development/9.3    #2723      +/-   ##
===================================================
- Coverage            74.48%   74.45%   -0.04%     
===================================================
  Files                  200      200              
  Lines                13603    13680      +77     
===================================================
+ Hits                 10132    10185      +53     
- Misses                3461     3485      +24     
  Partials                10       10              
Flag Coverage Δ
api:retry 9.09% <0.00%> (-0.06%) ⬇️
api:routes 8.91% <0.00%> (-0.06%) ⬇️
bucket-scanner 85.76% <ø> (ø)
ft_test:queuepopulator 9.05% <7.52%> (-0.98%) ⬇️
ingestion 12.48% <0.00%> (-0.08%) ⬇️
lib 7.58% <0.00%> (-0.04%) ⬇️
lifecycle 19.04% <58.06%> (+0.19%) ⬆️
notification 1.02% <0.00%> (-0.01%) ⬇️
oplogPopulator 0.14% <0.00%> (-0.01%) ⬇️
replication 18.44% <7.52%> (-0.06%) ⬇️
unit 51.16% <77.41%> (+0.14%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch 5 times, most recently from 8316f88 to 408c96c Compare March 11, 2026 16:03
@benzekrimaha benzekrimaha marked this pull request as ready for review March 11, 2026 16:35
@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from 408c96c to e1c5b13 Compare March 11, 2026 16:48
@francoisferrand francoisferrand requested a review from delthas March 18, 2026 09:19
@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from e1c5b13 to aefb677 Compare March 18, 2026 10:04
@benzekrimaha benzekrimaha changed the title Improvement/bb 740 monitor lifecycle conductor Improvement/BB-740 monitor lifecycle conductor Mar 18, 2026
@francoisferrand francoisferrand changed the title Improvement/BB-740 monitor lifecycle conductor monitor lifecycle conductor Mar 18, 2026
const log = this.logger.newRequestLogger();
const start = new Date();
const start = Date.now();
this._scanId = uuid();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we're storing the scan ID as a "global" field variable, but it sounds like it is really relevant/used only inside this function (through indirect calls). Could we drop the global field and instead pass it through to whatever uses it? Maybe in _createBucketTaskMessages?

@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from 725c3df to 11a94ea Compare March 19, 2026 09:36
@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from a2128cf to a464b39 Compare March 19, 2026 09:43
@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Mar 26, 2026

Incorrect fix version

The Fix Version/s in issue BB-740 contains:

  • None

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

  • 9.3.2

Please check the Fix Version/s of BB-740, or the target
branch of this pull request.

@claude
Copy link
Copy Markdown

claude bot commented Mar 26, 2026

  • Scan-end debounce timer not cleared on shutdown: _scanEndTimeout in LifecycleBucketProcessor is never cleared when the process stops. If the bucket processor shuts down while the 30s debounce is pending, the callback fires on a partially torn-down instance. Clear it in the stop/close path.
    - Scan start timestamp inaccuracy: onBucketProcessorScanStart records Date.now() at the time the bucket processor sees the first Kafka message, not when the conductor actually started the scan. The Bucket Processor Scan Progress dashboard panel will undercount scan duration by the Kafka delivery lag. Consider passing the conductor start timestamp via contextInfo.
    - bucketProcessorBucketsCount.reset() clears all label combinations: Calling .reset() on a gauge with labels zeros out all label combos, not just the current bucket_source. Fine if each deployment has one source, but worth documenting.
    - lifecycle_bucket metric emits 0 for bucketd: The conductorScanCount metric with type=lifecycle_bucket will always be 0 for bucketd since isLifecycled is never set in listBucketdBuckets. The dashboard handles this correctly by summing bucket processor counts for bucketd, but the raw metric could be misleading without context.

    Review by Claude Code

- Clear _scanEndTimeout in close() to prevent callback firing
  on a partially torn-down bucket processor instance
- Propagate conductor's start timestamp (Date.now() at scan
  start) via contextInfo.scanStartTimestamp instead of using
  Date.now() in the bucket processor, so the scan progress
  dashboard accurately reflects the full scan duration
- Document that bucketProcessorBucketsCount.reset() clears all
  label combinations (acceptable: single bucket source per
  deployment)
- Update tests and BackbeatTestConsumer normalizer for the new
  scanStartTimestamp field
@claude
Copy link
Copy Markdown

claude bot commented Mar 26, 2026

LGTM

Review by Claude Code

@benzekrimaha benzekrimaha requested a review from delthas March 30, 2026 16:52
});
LifecycleMetrics.onBucketProcessorScanEnd(this._log);
this._processorScanMetricsActive = false;
}, SCAN_END_DEBOUNCE_DELAY_MS);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Debouncing" is a flaky indication - a new scan may start very quickly, but still have a different scan...
The information is computed through conductorScanId, so this is the information we should rely on.

Additionally, just a gauge may be a bit hard to use; there may be a better way : like having a gauge whose value is the "number" of scans (increases each time) or one where the scanId is a label (so we will have a different serie each time)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also there may be a longer delay between messages..... especially in case of loaded system, where circuit breaker would kick in and delay processing.
→ cannot use a timeout.

const conductorScanId = result.contextInfo && result.contextInfo.conductorScanId;
const scanStartTimestamp = (result.contextInfo
&& result.contextInfo.scanStartTimestamp) || Date.now();
if (conductorScanId && conductorScanId !== this._metricsScanId) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this ordering guaranteed?

since there are multiple (asynchronous) consumers, each producing "continuation" messages which are pushed to any partition (round-robin): are we sure a new "start of scan" can only happen after the current scan has completed?

there is a mecanism which is supposed to delay starting a new scan until the previous one has completed, but it relies on an asynchronous mecanism (delayed publishing of offset in zookeeper) : so not sure how reliable it is in corner cases (e.g. next attempt excactly at the wrong time) ?

--
metrics should be designed to work all the time, and especially when things go wrong : e.g. if 2 scans run in parallel... How does this metrics degrade when thing go wrong? Can we keep obversability in that case?

(For exemple, the scan starts in conductor: so conductor could push one metrics with the ScanId as label and other all pods publish a metrics with the same label : then we may just look at the time series, and see where it starts and ends... but still be able to observe if multiple scan happen in parallel)

// This ensures we count actual processing completion, not just dispatch
// Only count successful completions, not failures
if (!err) {
const bucketSource = result.contextInfo && result.contextInfo.bucketSource;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const bucketSource = result.contextInfo && result.contextInfo.bucketSource;
const bucketSource = result.contextInfo?.bucketSource;

if (!err) {
const bucketSource = result.contextInfo && result.contextInfo.bucketSource;
LifecycleMetrics.onBucketProcessorBucketDone(
this._log, bucketSource, conductorScanId
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the bucket name should not be in the metrics, cardinality is too high (and buckets have no meaning for admins which can use monitoring)

Comment on lines +318 to +320
conductorScanId: scanId,
scanStartTimestamp,
bucketSource: this._bucketSource,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of going params, should we pass this as part of the task ? or as a class variable, like the (pre-existing) this._batchInProgress ?

it would greatly simplify passing data through different layers....

Comment on lines +240 to +243
log, err, 'LifecycleMetrics.onConductorFullScan', {
elapsedMs, bucketCount, workflowCount,
lifecycleBucketCount, bucketSource,
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all these parameter are not needed by the function, only for a log in case of error..... which should never happen anyway (the only reason is when the wrong labels are used or when invalid values are passed) : so best keep it simple, with only the parameters needed (and logging the error and actual parameters passed)

}

/**
* Signal that a bucket processor has started working on a scan.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not doing any signaling, just updating metrics


const conductorScanCount = ZenkoMetrics.createGauge({
name: 's3_lifecycle_conductor_scan_count',
help: 'Number of buckets and queued workflows for the latest full lifecycle conductor scan',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buckets or workflows?

});

const conductorScanCount = ZenkoMetrics.createGauge({
name: 's3_lifecycle_conductor_scan_count',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • a "scan count" would be the number of scan : i.e. incremented on every new scan. this is not what this metrics is: it is the number of buckets scanned
  • according to best practices for naming metrics, count is mostly used for Counters... so need to be more precise, with additional segment "current"
  • this is not a "gauge" in the usual sense (tracking data in ~ real time) : this just indicate the total/final number of scanned buckets in the last scan

so something like s3_lifecycle_conductor_last_scan_buckets_count or better yet s3_lifecycle_latest_batch_size (which is consistent with the existing metrics)

const bucketProcessorBucketsCount = ZenkoMetrics.createGauge({
name: 's3_lifecycle_bucket_processor_buckets_count',
help: 'Number of buckets successfully processed by '
+ 'this bucket processor during the current scan',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly, this is not the number of buckets processed during the scan : but the total number at the end of the scan → name is not good

more fundamentally, this metrics is only updated at the end of the scan : so until the scan ends, we have no idea what is going on. Better to have dynamic metrics, so we can observe the behavior in every situation.

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Apr 3, 2026

Incorrect fix version

The Fix Version/s in issue BB-740 contains:

  • None

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

  • 9.3.3

Please check the Fix Version/s of BB-740, or the target
branch of this pull request.

name: 's3_lifecycle_bucket_processor_scan_bucket_done_count',
help: 'Number of buckets completed by this bucket processor for a scan',
labelNames: [LIFECYCLE_LABEL_ORIGIN, LIFECYCLE_LABEL_CONDUCTOR_SCAN_ID],
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unbounded label cardinality on conductor_scan_id.

bucketProcessorScanStartTime and bucketProcessorScanBucketDoneCount use conductor_scan_id (a new UUID every scan) as a label. Since old label values are never removed from the prom-client registry, the number of time series grows unboundedly with each scan cycle. This will cause increasing memory usage in the bucket processor over time.

Consider either: (1) calling .remove() on old label sets when a new scan starts, or (2) dropping the conductor_scan_id label entirely and resetting the gauge value on each new scan instead.

— Claude Code

const parsed = JSON.parse(taskMessage.message);
assert.strictEqual(parsed.contextInfo.conductorScanId, 'scan-id-test');
assert.strictEqual(parsed.contextInfo.scanStartTimestamp, 1706000000000);
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test passes extra args that _taskToMessage ignores — assertions will fail.

The test calls conductor._taskToMessage(getTask(true), lifecycleTaskVersions.v2, scan-id-test, 1706000000000, log) with 5 args, but _taskToMessage(task, taskVersion, log) only accepts 3 params and reads this._currentScanId / this._currentScanStartTimestamp from instance state. Since neither is set in beforeEach, both will be null, and the strictEqual assertions will fail.

Fix: set conductor._currentScanId and conductor._currentScanStartTimestamp before calling _taskToMessage, and pass only 3 args (task, taskVersion, log).

— Claude Code

});
sinon.stub(conductor, '_createBucketTaskMessages')
.callsFake((_a, _b, cb) => {
.callsFake((_a, _b, _c, _d, cb) => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stub arity mismatch — cb will be undefined.

This stub expects 5 params (_a, _b, _c, _d, cb) but _createBucketTaskMessages(tasks, log, cb) is called with only 3 args. So _c receives the callback, _d and cb are undefined, and cb(null, []) on line 306 will throw a TypeError. This should be .callsFake((_a, _b, cb) => ...).

— Claude Code


BUCKET_PROCESSOR_BUCKETS_COUNT = metrics.Metric(
's3_lifecycle_bucket_processor_buckets_count',
'bucket_source', job='${job_lifecycle_bucket_processor}', namespace='${namespace}',
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metric names in dashboard.py do not match LifecycleMetrics.js.

FULL_SCAN_ELAPSED references s3_lifecycle_conductor_full_scan_elapsed_seconds — this metric does not exist. The actual metrics are s3_lifecycle_latest_batch_start_time and s3_lifecycle_latest_batch_end_time.

SCAN_COUNT references s3_lifecycle_conductor_scan_count — does not exist. The actual metric is s3_lifecycle_latest_batch_size.

BUCKET_PROCESSOR_BUCKETS_COUNT references s3_lifecycle_bucket_processor_buckets_count — does not exist. The actual metric is s3_lifecycle_bucket_processor_scan_bucket_done_count.

The dashboard.json uses the correct names. These mismatches will produce empty Grafana panels when the dashboard is generated from this Python source.

— Claude Code

@claude
Copy link
Copy Markdown

claude bot commented Apr 3, 2026

  • Unbounded Prometheus label cardinality: conductor_scan_id (a UUID per scan) is used as a gauge label on bucketProcessorScanStartTime and bucketProcessorScanBucketDoneCount. Old label sets are never removed, so time series count grows unboundedly.
    • Call .remove() on old label sets when a new scan starts, or drop the conductor_scan_id label and reset gauge values instead.
  • Test bug: _taskToMessage test will fail: Test passes 5 args but the method only accepts 3 and reads scan ID from instance state (which is null).
    • Set conductor._currentScanId and conductor._currentScanStartTimestamp before calling _taskToMessage, pass only 3 args.
  • Test bug: _createBucketTaskMessages stub arity mismatch: Stub expects 5 params but the method is called with 3, so cb is undefined and will throw TypeError.
    • Change to .callsFake((_a, _b, cb) => ...).
  • dashboard.py metric names do not match LifecycleMetrics.js: FULL_SCAN_ELAPSED, SCAN_COUNT, and BUCKET_PROCESSOR_BUCKETS_COUNT reference non-existent metric names. dashboard.json has the correct names.
    • Align dashboard.py metric names with the actual metrics defined in LifecycleMetrics.js.

Review by Claude Code

Replace timeout-derived lifecycle scan gauges with scan-id keyed series and align conductor metrics around latest batch start/end and size so dashboards remain reliable during delays and overlapping scans.
Drive bucket processor scan tracking from conductorScanId boundaries and simplify conductor scan context propagation/logging so observability remains correct even with delayed or overlapping message flows.
Use direct logger default-field attachment and include conductorScanId in logged task attributes across lifecycle tasks to keep request traces consistent through the pipeline.
@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from 6471e77 to a9ab181 Compare April 3, 2026 07:38
const bucketProcessorScanStartTime = ZenkoMetrics.createGauge({
name: 's3_lifecycle_bucket_processor_scan_start_time',
help: 'Timestamp (ms) when this bucket processor first saw a scan',
labelNames: [LIFECYCLE_LABEL_ORIGIN, LIFECYCLE_LABEL_CONDUCTOR_SCAN_ID],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unbounded label cardinality: bucketProcessorScanStartTime and bucketProcessorScanBucketDoneCount use conductor_scan_id (a per-scan UUID) as a Prometheus label. Each scan creates new time series that are never cleaned up, causing unbounded memory growth in both the app and Prometheus. Consider tracking only the current scan (without the scan-ID label) and resetting the gauge on each new scan, or using a fixed-cardinality approach.

— Claude Code

const parsed = JSON.parse(taskMessage.message);
assert.strictEqual(parsed.contextInfo.conductorScanId, 'scan-id-test');
assert.strictEqual(parsed.contextInfo.scanStartTimestamp, 1706000000000);
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong arguments: _taskToMessage signature is (task, taskVersion, log) but this test calls it with 5 args: (task, taskVersion, scan-id-test, 1706000000000, log). The string scan-id-test is passed as log, so log.getSerializedUids() will throw a TypeError. The method reads conductorScanId from this._currentScanId, not from parameters. Fix: set conductor._currentScanId and conductor._currentScanStartTimestamp before calling _taskToMessage with 3 args.

— Claude Code

's3_lifecycle_bucket_processor_buckets_count',
'bucket_source', job='${job_lifecycle_bucket_processor}', namespace='${namespace}',
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metric name mismatches: Several metrics referenced here do not match what is defined in LifecycleMetrics.js:
- FULL_SCAN_ELAPSED references s3_lifecycle_conductor_full_scan_elapsed_seconds — does not exist in the app
- SCAN_COUNT references s3_lifecycle_conductor_scan_count — does not exist in the app
- BUCKET_PROCESSOR_BUCKETS_COUNT references s3_lifecycle_bucket_processor_buckets_count — the actual metric is s3_lifecycle_bucket_processor_scan_bucket_done_count

The dashboard JSON panels use the correct metric names (e.g. s3_lifecycle_latest_batch_end_time, s3_lifecycle_latest_batch_size), so the JSON and Python are out of sync.

— Claude Code

@claude
Copy link
Copy Markdown

claude bot commented Apr 3, 2026

  • Unbounded Prometheus label cardinality on conductor_scan_id: bucketProcessorScanStartTime and bucketProcessorScanBucketDoneCount gauges use a per-scan UUID as a label, creating ever-growing time series that are never cleaned up
    • Remove the conductor_scan_id label and reset the gauges on each new scan, or cap cardinality another way
  • Test bug: _taskToMessage called with wrong arguments (tests/unit/lifecycle/LifecycleConductor.spec.js:329): passes 5 args to a 3-param method, causing log to be the string 'scan-id-test' which will throw on getSerializedUids()
    • Set conductor._currentScanId and conductor._currentScanStartTimestamp instead, then call with 3 args
  • Test bug: _createBucketTaskMessages stub has wrong arity (tests/unit/lifecycle/LifecycleConductor.spec.js:303): stub takes 5 params but method is called with 3, so cb is undefined
    • Change stub to .callsFake((_a, _b, cb) => {
  • dashboard.py metric name mismatches: FULL_SCAN_ELAPSED, SCAN_COUNT, and BUCKET_PROCESSOR_BUCKETS_COUNT reference metric names that do not exist in LifecycleMetrics.js, while the dashboard JSON panels use the correct names
    • Align the Python metric names with those defined in the application code

Review by Claude Code

@claude
Copy link
Copy Markdown

claude bot commented Apr 3, 2026

LGTM

Label cardinality on conductor_scan_id is well managed with the MAX_TRACKED_BUCKET_PROCESSOR_SCANS=16 cap and .remove() cleanup. Scan ID propagation through the task pipeline is thorough, and the test coverage for the new metrics is solid.

Review by Claude Code

Replace class field syntax with a post-class assignment and update test label assertions to satisfy camelcase linting while keeping conductor_scan_id validation intact.
if (err) {
log.error('lifecycle batch failed', { error: err, unknownCanonicalIds });
log.error('lifecycle batch failed', {
error: err,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unknownCanonicalIds (from this._accountIdCache.getMisses()) was logged here in both the error and success paths before this PR. It has been removed entirely — this loses useful debugging context for production troubleshooting. Consider preserving it alongside the new fields.

— Claude Code

@claude
Copy link
Copy Markdown

claude bot commented Apr 3, 2026

  • unknownCanonicalIds logging was removed from the batch completion paths (both error and success) in LifecycleConductor.js. This loses production debugging context — consider restoring it alongside the new fields.

    Rest of the PR looks solid: metrics are well-bounded (MAX_TRACKED_BUCKET_PROCESSOR_SCANS = 16), error handling is consistent, scan ID propagation through the pipeline is thorough, and tests cover the new paths.

    Review by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants