feat(CLDSRV-884): Add OpenTelemetry tracing instrumentation#6140
feat(CLDSRV-884): Add OpenTelemetry tracing instrumentation#6140delthas wants to merge 4 commits intodevelopment/9.3from
Conversation
Hello delthas,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
Request integration branchesWaiting for integration branch creation to be requested by the user. To request integration branches, please comment on this pull request with the following command: Alternatively, the |
❌ 2 Tests Failed:
View the top 2 failed test(s) by shortest run time
View the full list of 8 ❄️ flaky test(s)
To view more test analytics, go to the Test Analytics Dashboard |
|
9c93a1d to
97e63fc
Compare
| instrumentApi, | ||
| instrumentServices, | ||
| instrumentApiMethod, | ||
| }; |
There was a problem hiding this comment.
instrumentVault, instrumentStorage, instrumentMetadata, instrumentApi, and instrumentServices are exported but never imported anywhere else in the codebase. This is ~300 lines of dead code. If these are intended for future use, consider removing them from this PR and adding them when they are actually wired in.
— Claude Code
There was a problem hiding this comment.
Intentional for now. Must address before merging.
Review by Claude Code |
97e63fc to
f978280
Compare
|
|
||
| const callbackIndex = args.findLastIndex(arg => typeof arg === 'function'); | ||
|
|
||
| if (callbackIndex === -1) { |
There was a problem hiding this comment.
Unlike createInstrumentedProxy (which checks for .then on the return value), this no-callback path ends the span synchronously. If any API handler is migrated from callbacks to async/await and starts returning a promise, the span will end before the async work completes, producing misleading trace data. Add the same promise-detection logic as the proxy version for consistency.
— Claude Code
There was a problem hiding this comment.
Interesting. Need to check that back later.
|
f978280 to
06eea4e
Compare
| }, | ||
| }); | ||
|
|
||
| const callbackIndex = args.findLastIndex(arg => typeof arg === 'function'); |
There was a problem hiding this comment.
The callback-wrapping proxy intercepts the last function argument as the callback via findLastIndex. If the original method throws synchronously after already having invoked the callback, the span will be ended twice: once in the wrapped callback and once in the catch block. This is benign (OTEL ignores double end()) but worth documenting with a comment.
— Claude Code
| @@ -0,0 +1,486 @@ | |||
| 'use strict'; | |||
|
|
|||
| const { trace, context, SpanStatusCode, SpanKind } = require('@opentelemetry/api'); | |||
There was a problem hiding this comment.
@opentelemetry/api is required unconditionally at the top of this file (line 3) even when ENABLE_OTEL is false. While @opentelemetry/api is a lightweight no-op API package, it still adds module loading overhead on every startup. Consider lazy-requiring it inside the enableOtel guard, consistent with how lib/otel.js and lib/server.js handle conditional requires.
— Claude Code
| "@azure/storage-blob": "^12.28.0", | ||
| "@hapi/joi": "^17.1.1", | ||
| "@opentelemetry/api": "^1.9.0", | ||
| "@opentelemetry/auto-instrumentations-node": "^0.50.2", |
There was a problem hiding this comment.
The OTEL dependency versions use caret ranges (^1.9.0, ^0.50.2, ^0.55.0, etc.). OpenTelemetry packages have had breaking changes between minor versions in the 0.x range. Consider pinning the 0.x packages to exact versions or tighter ranges (e.g. ~0.55.0 instead of ^0.55.0) to avoid unexpected breakage on yarn install.
— Claude Code
|
Set up the NodeSDK with OTLP/HTTP trace exporter, gated behind ENABLE_OTEL=true. Uses 1% sampling by default to limit collector load. Auto-instruments HTTP, Express and MongoDB while disabling high-volume fs and Redis instrumentations. Issue: CLDSRV-884
Proxy-based auto-wrapping of vault, storage, metadata, api, and services with low-cardinality span names. Instruments both callback-based and promise-based methods transparently. Issue: CLDSRV-884
When an inbound request carries traceparent/tracestate headers, extract them via propagation.extract() and scope the entire request processing under that context. This connects cloudserver spans to upstream callers (e.g. NGINX, Beyla) as a single distributed trace. Issue: CLDSRV-884
Wrap all 70+ S3 API handler methods with instrumentApiMethod() from the Proxy-based instrumentation module. Each handler gets a dedicated span with low-cardinality name and operation type attributes when ENABLE_OTEL=true; zero overhead when disabled. Issue: CLDSRV-884
06eea4e to
75c3afb
Compare
| instrumentApi, | ||
| instrumentServices, | ||
| instrumentApiMethod, | ||
| }; |
There was a problem hiding this comment.
instrumentVault, instrumentStorage, instrumentMetadata, instrumentApi, and instrumentServices are exported but never imported or used anywhere in this PR. This is dead code — either wire them into the server startup (e.g. wrapping vault/metadata/storage instances) or remove them until they are actually needed.
— Claude Code
| @@ -0,0 +1,486 @@ | |||
| 'use strict'; | |||
|
|
|||
| const { trace, context, SpanStatusCode, SpanKind } = require('@opentelemetry/api'); | |||
There was a problem hiding this comment.
@opentelemetry/api is unconditionally require()d at the top of this file (line 3), even when ENABLE_OTEL is false. This means every cloudserver process loads the OTEL API module regardless of the feature flag. Since the PR description states zero overhead when OTEL is disabled, this import should be gated behind the enableOtel check, similar to how lib/otel.js and lib/server.js conditionally require it.
— Claude Code
| "nan": "v2.22.0", | ||
| "fast-xml-parser": "^5.5.6" | ||
| "fast-xml-parser": "^5.5.6", | ||
| "@types/pg": "8.6.1" |
There was a problem hiding this comment.
The @types/pg resolution is unrelated to the OTEL instrumentation feature. If it fixes a transitive dependency issue, it should be in its own commit with an explanation of why it is needed, or removed from this PR.
— Claude Code
|
|
||
| sdk.start(); | ||
|
|
||
| const shutdown = () => sdk.shutdown().catch(() => {}); |
There was a problem hiding this comment.
The shutdown handler silently swallows errors with .catch(() => {}). At minimum, log the error so failed OTEL shutdowns (e.g. exporter flush failures) are visible in diagnostics.
— Claude Code
| return originalValue; | ||
| } | ||
|
|
||
| return function (...args) { |
There was a problem hiding this comment.
The Proxy get trap returns a new wrapper function on every property access. This means obj.method !== obj.method, which breaks any code that compares function references or uses them as map keys. Consider caching the wrapped functions (e.g. in a WeakMap or a plain object keyed by prop) so the same wrapper is returned for repeated accesses of the same method.
— Claude Code
Review by Claude Code |
Not human-reviewed yet. Not asking for reviews at the moment!
Summary
Add OpenTelemetry tracing instrumentation to cloudserver, gated behind
ENABLE_OTEL=true.lib/otel.js): OTLP/HTTP trace exporter, 1% default sampling ratio, auto-instrumentation for HTTP, Express, and MongoDBlib/instrumentation/simple.js): auto-wrapping of vault, storage, metadata, api, and services components with low-cardinality span nameslib/server.js): extracts inboundtraceparent/tracestateheaders and scopes request processing under the remote contextlib/api/api.js): all 70+ S3 API handlers wrapped withinstrumentApiMethod()for per-operation spansWhen
ENABLE_OTELis not set, there is zero overhead — the OTEL SDK is not loaded,@opentelemetry/apiis not required inserver.js, andinstrumentApiMethod()returns the original function unchanged.Origin of changes
Extracted and cleaned up from William Lardier's
user/test/wlardier/servicemesh-2branch (based ondevelopment/9.0, July–Sep 2025). The original branch mixed OTEL instrumentation with performance optimizations and debug artifacts. This PR contains only the OTEL instrumentation, rebased ontodevelopment/9.3.What was removed vs the source branch:
lib/otelContextPropagation.js(manualglobal.currentTraceHeadershack, never imported)console.log()statements, commented-out span codeMOCK_DOAUTH/lib/api/apiUtils/mock/backendMocks.js(caches first real auth result — dangerous in production, unrelated to OTEL)GC_INTERVAL_MS/ manual GC,monitorLatency(),releaseRequestContexts()pooling, arsenal perf pin — these will go in a separate PR8.3.8(9.3's version) instead of William's perf-pinned commit20a5fdc2What was fixed during review:
@opentelemetry/sdk-trace-baseversion aligned to^1.28.0(compatible withsdk-node ^0.55.0)OTEL_SAMPLING_RATIO=0now correctly disables sampling (was falling back to 1% due to||vs??)ENABLE_OTELcheck (was running unconditionally)context.with()for proper parent-child span linkingSemanticResourceAttributesreplaced with string literalsservice.versionreads frompackage.jsoninstead of hardcoded valueContext
@opentelemetry/instrumentation-httpauto-instrumentation should automatically injecttraceparentinto outgoing hdclient HTTP calls, connecting cloudserver traces to hdcontroller/hyperiod — this has never been tested end-to-end yetVerification
ENABLE_OTEL=trueand an OTEL collector endpoint-tflag) and hyperiod (enable_tracer: true) — verify traces flow end-to-end as a single distributed trace