Conversation
Hello benzekrimaha,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
380069a to
25ea9d5
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files
... and 5 files with indirect coverage changes
@@ Coverage Diff @@
## development/9.3 #2723 +/- ##
===================================================
- Coverage 74.48% 74.45% -0.04%
===================================================
Files 200 200
Lines 13603 13680 +77
===================================================
+ Hits 10132 10185 +53
- Misses 3461 3485 +24
Partials 10 10
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
8316f88 to
408c96c
Compare
408c96c to
e1c5b13
Compare
e1c5b13 to
aefb677
Compare
| const log = this.logger.newRequestLogger(); | ||
| const start = new Date(); | ||
| const start = Date.now(); | ||
| this._scanId = uuid(); |
There was a problem hiding this comment.
Hmm, we're storing the scan ID as a "global" field variable, but it sounds like it is really relevant/used only inside this function (through indirect calls). Could we drop the global field and instead pass it through to whatever uses it? Maybe in _createBucketTaskMessages?
extensions/lifecycle/bucketProcessor/LifecycleBucketProcessor.js
Outdated
Show resolved
Hide resolved
725c3df to
11a94ea
Compare
extensions/lifecycle/bucketProcessor/LifecycleBucketProcessor.js
Outdated
Show resolved
Hide resolved
extensions/lifecycle/bucketProcessor/LifecycleBucketProcessor.js
Outdated
Show resolved
Hide resolved
a2128cf to
a464b39
Compare
extensions/lifecycle/bucketProcessor/LifecycleBucketProcessor.js
Outdated
Show resolved
Hide resolved
extensions/lifecycle/bucketProcessor/LifecycleBucketProcessor.js
Outdated
Show resolved
Hide resolved
|
- Clear _scanEndTimeout in close() to prevent callback firing on a partially torn-down bucket processor instance - Propagate conductor's start timestamp (Date.now() at scan start) via contextInfo.scanStartTimestamp instead of using Date.now() in the bucket processor, so the scan progress dashboard accurately reflects the full scan duration - Document that bucketProcessorBucketsCount.reset() clears all label combinations (acceptable: single bucket source per deployment) - Update tests and BackbeatTestConsumer normalizer for the new scanStartTimestamp field
|
LGTM |
| }); | ||
| LifecycleMetrics.onBucketProcessorScanEnd(this._log); | ||
| this._processorScanMetricsActive = false; | ||
| }, SCAN_END_DEBOUNCE_DELAY_MS); |
There was a problem hiding this comment.
"Debouncing" is a flaky indication - a new scan may start very quickly, but still have a different scan...
The information is computed through conductorScanId, so this is the information we should rely on.
Additionally, just a gauge may be a bit hard to use; there may be a better way : like having a gauge whose value is the "number" of scans (increases each time) or one where the scanId is a label (so we will have a different serie each time)
There was a problem hiding this comment.
also there may be a longer delay between messages..... especially in case of loaded system, where circuit breaker would kick in and delay processing.
→ cannot use a timeout.
| const conductorScanId = result.contextInfo && result.contextInfo.conductorScanId; | ||
| const scanStartTimestamp = (result.contextInfo | ||
| && result.contextInfo.scanStartTimestamp) || Date.now(); | ||
| if (conductorScanId && conductorScanId !== this._metricsScanId) { |
There was a problem hiding this comment.
is this ordering guaranteed?
since there are multiple (asynchronous) consumers, each producing "continuation" messages which are pushed to any partition (round-robin): are we sure a new "start of scan" can only happen after the current scan has completed?
there is a mecanism which is supposed to delay starting a new scan until the previous one has completed, but it relies on an asynchronous mecanism (delayed publishing of offset in zookeeper) : so not sure how reliable it is in corner cases (e.g. next attempt excactly at the wrong time) ?
--
metrics should be designed to work all the time, and especially when things go wrong : e.g. if 2 scans run in parallel... How does this metrics degrade when thing go wrong? Can we keep obversability in that case?
(For exemple, the scan starts in conductor: so conductor could push one metrics with the ScanId as label and other all pods publish a metrics with the same label : then we may just look at the time series, and see where it starts and ends... but still be able to observe if multiple scan happen in parallel)
| // This ensures we count actual processing completion, not just dispatch | ||
| // Only count successful completions, not failures | ||
| if (!err) { | ||
| const bucketSource = result.contextInfo && result.contextInfo.bucketSource; |
There was a problem hiding this comment.
| const bucketSource = result.contextInfo && result.contextInfo.bucketSource; | |
| const bucketSource = result.contextInfo?.bucketSource; |
| if (!err) { | ||
| const bucketSource = result.contextInfo && result.contextInfo.bucketSource; | ||
| LifecycleMetrics.onBucketProcessorBucketDone( | ||
| this._log, bucketSource, conductorScanId |
There was a problem hiding this comment.
the bucket name should not be in the metrics, cardinality is too high (and buckets have no meaning for admins which can use monitoring)
| conductorScanId: scanId, | ||
| scanStartTimestamp, | ||
| bucketSource: this._bucketSource, |
There was a problem hiding this comment.
instead of going params, should we pass this as part of the task ? or as a class variable, like the (pre-existing) this._batchInProgress ?
it would greatly simplify passing data through different layers....
| log, err, 'LifecycleMetrics.onConductorFullScan', { | ||
| elapsedMs, bucketCount, workflowCount, | ||
| lifecycleBucketCount, bucketSource, | ||
| } |
There was a problem hiding this comment.
all these parameter are not needed by the function, only for a log in case of error..... which should never happen anyway (the only reason is when the wrong labels are used or when invalid values are passed) : so best keep it simple, with only the parameters needed (and logging the error and actual parameters passed)
| } | ||
|
|
||
| /** | ||
| * Signal that a bucket processor has started working on a scan. |
There was a problem hiding this comment.
this is not doing any signaling, just updating metrics
|
|
||
| const conductorScanCount = ZenkoMetrics.createGauge({ | ||
| name: 's3_lifecycle_conductor_scan_count', | ||
| help: 'Number of buckets and queued workflows for the latest full lifecycle conductor scan', |
There was a problem hiding this comment.
buckets or workflows?
| }); | ||
|
|
||
| const conductorScanCount = ZenkoMetrics.createGauge({ | ||
| name: 's3_lifecycle_conductor_scan_count', |
There was a problem hiding this comment.
- a "scan count" would be the number of scan : i.e. incremented on every new scan. this is not what this metrics is: it is the number of buckets scanned
- according to best practices for naming metrics,
countis mostly used for Counters... so need to be more precise, with additional segment "current" - this is not a "gauge" in the usual sense (tracking data in ~ real time) : this just indicate the total/final number of scanned buckets in the last scan
so something like s3_lifecycle_conductor_last_scan_buckets_count or better yet s3_lifecycle_latest_batch_size (which is consistent with the existing metrics)
| const bucketProcessorBucketsCount = ZenkoMetrics.createGauge({ | ||
| name: 's3_lifecycle_bucket_processor_buckets_count', | ||
| help: 'Number of buckets successfully processed by ' | ||
| + 'this bucket processor during the current scan', |
There was a problem hiding this comment.
similarly, this is not the number of buckets processed during the scan : but the total number at the end of the scan → name is not good
more fundamentally, this metrics is only updated at the end of the scan : so until the scan ends, we have no idea what is going on. Better to have dynamic metrics, so we can observe the behavior in every situation.
| name: 's3_lifecycle_bucket_processor_scan_bucket_done_count', | ||
| help: 'Number of buckets completed by this bucket processor for a scan', | ||
| labelNames: [LIFECYCLE_LABEL_ORIGIN, LIFECYCLE_LABEL_CONDUCTOR_SCAN_ID], | ||
| }); |
There was a problem hiding this comment.
Unbounded label cardinality on conductor_scan_id.
bucketProcessorScanStartTime and bucketProcessorScanBucketDoneCount use conductor_scan_id (a new UUID every scan) as a label. Since old label values are never removed from the prom-client registry, the number of time series grows unboundedly with each scan cycle. This will cause increasing memory usage in the bucket processor over time.
Consider either: (1) calling .remove() on old label sets when a new scan starts, or (2) dropping the conductor_scan_id label entirely and resetting the gauge value on each new scan instead.
— Claude Code
| const parsed = JSON.parse(taskMessage.message); | ||
| assert.strictEqual(parsed.contextInfo.conductorScanId, 'scan-id-test'); | ||
| assert.strictEqual(parsed.contextInfo.scanStartTimestamp, 1706000000000); | ||
| }); |
There was a problem hiding this comment.
Test passes extra args that _taskToMessage ignores — assertions will fail.
The test calls conductor._taskToMessage(getTask(true), lifecycleTaskVersions.v2, scan-id-test, 1706000000000, log) with 5 args, but _taskToMessage(task, taskVersion, log) only accepts 3 params and reads this._currentScanId / this._currentScanStartTimestamp from instance state. Since neither is set in beforeEach, both will be null, and the strictEqual assertions will fail.
Fix: set conductor._currentScanId and conductor._currentScanStartTimestamp before calling _taskToMessage, and pass only 3 args (task, taskVersion, log).
— Claude Code
| }); | ||
| sinon.stub(conductor, '_createBucketTaskMessages') | ||
| .callsFake((_a, _b, cb) => { | ||
| .callsFake((_a, _b, _c, _d, cb) => { |
There was a problem hiding this comment.
Stub arity mismatch — cb will be undefined.
This stub expects 5 params (_a, _b, _c, _d, cb) but _createBucketTaskMessages(tasks, log, cb) is called with only 3 args. So _c receives the callback, _d and cb are undefined, and cb(null, []) on line 306 will throw a TypeError. This should be .callsFake((_a, _b, cb) => ...).
— Claude Code
monitoring/lifecycle/dashboard.py
Outdated
|
|
||
| BUCKET_PROCESSOR_BUCKETS_COUNT = metrics.Metric( | ||
| 's3_lifecycle_bucket_processor_buckets_count', | ||
| 'bucket_source', job='${job_lifecycle_bucket_processor}', namespace='${namespace}', |
There was a problem hiding this comment.
Metric names in dashboard.py do not match LifecycleMetrics.js.
FULL_SCAN_ELAPSED references s3_lifecycle_conductor_full_scan_elapsed_seconds — this metric does not exist. The actual metrics are s3_lifecycle_latest_batch_start_time and s3_lifecycle_latest_batch_end_time.
SCAN_COUNT references s3_lifecycle_conductor_scan_count — does not exist. The actual metric is s3_lifecycle_latest_batch_size.
BUCKET_PROCESSOR_BUCKETS_COUNT references s3_lifecycle_bucket_processor_buckets_count — does not exist. The actual metric is s3_lifecycle_bucket_processor_scan_bucket_done_count.
The dashboard.json uses the correct names. These mismatches will produce empty Grafana panels when the dashboard is generated from this Python source.
— Claude Code
Review by Claude Code |
Replace timeout-derived lifecycle scan gauges with scan-id keyed series and align conductor metrics around latest batch start/end and size so dashboards remain reliable during delays and overlapping scans.
Drive bucket processor scan tracking from conductorScanId boundaries and simplify conductor scan context propagation/logging so observability remains correct even with delayed or overlapping message flows.
Use direct logger default-field attachment and include conductorScanId in logged task attributes across lifecycle tasks to keep request traces consistent through the pipeline.
6471e77 to
a9ab181
Compare
| const bucketProcessorScanStartTime = ZenkoMetrics.createGauge({ | ||
| name: 's3_lifecycle_bucket_processor_scan_start_time', | ||
| help: 'Timestamp (ms) when this bucket processor first saw a scan', | ||
| labelNames: [LIFECYCLE_LABEL_ORIGIN, LIFECYCLE_LABEL_CONDUCTOR_SCAN_ID], |
There was a problem hiding this comment.
Unbounded label cardinality: bucketProcessorScanStartTime and bucketProcessorScanBucketDoneCount use conductor_scan_id (a per-scan UUID) as a Prometheus label. Each scan creates new time series that are never cleaned up, causing unbounded memory growth in both the app and Prometheus. Consider tracking only the current scan (without the scan-ID label) and resetting the gauge on each new scan, or using a fixed-cardinality approach.
— Claude Code
| const parsed = JSON.parse(taskMessage.message); | ||
| assert.strictEqual(parsed.contextInfo.conductorScanId, 'scan-id-test'); | ||
| assert.strictEqual(parsed.contextInfo.scanStartTimestamp, 1706000000000); | ||
| }); |
There was a problem hiding this comment.
Wrong arguments: _taskToMessage signature is (task, taskVersion, log) but this test calls it with 5 args: (task, taskVersion, scan-id-test, 1706000000000, log). The string scan-id-test is passed as log, so log.getSerializedUids() will throw a TypeError. The method reads conductorScanId from this._currentScanId, not from parameters. Fix: set conductor._currentScanId and conductor._currentScanStartTimestamp before calling _taskToMessage with 3 args.
— Claude Code
| 's3_lifecycle_bucket_processor_buckets_count', | ||
| 'bucket_source', job='${job_lifecycle_bucket_processor}', namespace='${namespace}', | ||
| ) | ||
|
|
There was a problem hiding this comment.
Metric name mismatches: Several metrics referenced here do not match what is defined in LifecycleMetrics.js:
- FULL_SCAN_ELAPSED references s3_lifecycle_conductor_full_scan_elapsed_seconds — does not exist in the app
- SCAN_COUNT references s3_lifecycle_conductor_scan_count — does not exist in the app
- BUCKET_PROCESSOR_BUCKETS_COUNT references s3_lifecycle_bucket_processor_buckets_count — the actual metric is s3_lifecycle_bucket_processor_scan_bucket_done_count
The dashboard JSON panels use the correct metric names (e.g. s3_lifecycle_latest_batch_end_time, s3_lifecycle_latest_batch_size), so the JSON and Python are out of sync.
— Claude Code
Review by Claude Code |
|
LGTM |
Replace class field syntax with a post-class assignment and update test label assertions to satisfy camelcase linting while keeping conductor_scan_id validation intact.
| if (err) { | ||
| log.error('lifecycle batch failed', { error: err, unknownCanonicalIds }); | ||
| log.error('lifecycle batch failed', { | ||
| error: err, |
There was a problem hiding this comment.
unknownCanonicalIds (from this._accountIdCache.getMisses()) was logged here in both the error and success paths before this PR. It has been removed entirely — this loses useful debugging context for production troubleshooting. Consider preserving it alongside the new fields.
— Claude Code
|
Issue: BB-740