Skip to content

feat(CLDSRV-884): Add OpenTelemetry tracing instrumentation#6140

Draft
delthas wants to merge 4 commits intodevelopment/9.3from
improvement/CLDSRV-884/otel-instrumentation
Draft

feat(CLDSRV-884): Add OpenTelemetry tracing instrumentation#6140
delthas wants to merge 4 commits intodevelopment/9.3from
improvement/CLDSRV-884/otel-instrumentation

Conversation

@delthas
Copy link
Copy Markdown
Contributor

@delthas delthas commented Apr 2, 2026

Not human-reviewed yet. Not asking for reviews at the moment!

Summary

Add OpenTelemetry tracing instrumentation to cloudserver, gated behind ENABLE_OTEL=true.

  • NodeSDK setup (lib/otel.js): OTLP/HTTP trace exporter, 1% default sampling ratio, auto-instrumentation for HTTP, Express, and MongoDB
  • Proxy-based instrumentation (lib/instrumentation/simple.js): auto-wrapping of vault, storage, metadata, api, and services components with low-cardinality span names
  • W3C trace context propagation (lib/server.js): extracts inbound traceparent/tracestate headers and scopes request processing under the remote context
  • API handler instrumentation (lib/api/api.js): all 70+ S3 API handlers wrapped with instrumentApiMethod() for per-operation spans

When ENABLE_OTEL is not set, there is zero overhead — the OTEL SDK is not loaded, @opentelemetry/api is not required in server.js, and instrumentApiMethod() returns the original function unchanged.

Origin of changes

Extracted and cleaned up from William Lardier's user/test/wlardier/servicemesh-2 branch (based on development/9.0, July–Sep 2025). The original branch mixed OTEL instrumentation with performance optimizations and debug artifacts. This PR contains only the OTEL instrumentation, rebased onto development/9.3.

What was removed vs the source branch:

  • Dead code: lib/otelContextPropagation.js (manual global.currentTraceHeaders hack, never imported)
  • Debug artifacts: ~15 console.log() statements, commented-out span code
  • Mock feature: MOCK_DOAUTH / lib/api/apiUtils/mock/backendMocks.js (caches first real auth result — dangerous in production, unrelated to OTEL)
  • Performance optimizations: GC_INTERVAL_MS / manual GC, monitorLatency(), releaseRequestContexts() pooling, arsenal perf pin — these will go in a separate PR
  • Arsenal dependency reverted to standard 8.3.8 (9.3's version) instead of William's perf-pinned commit 20a5fdc2

What was fixed during review:

  • @opentelemetry/sdk-trace-base version aligned to ^1.28.0 (compatible with sdk-node ^0.55.0)
  • OTEL_SAMPLING_RATIO=0 now correctly disables sampling (was falling back to 1% due to || vs ??)
  • Trace context extraction gated behind ENABLE_OTEL check (was running unconditionally)
  • No-callback code paths now wrapped in context.with() for proper parent-child span linking
  • Deprecated SemanticResourceAttributes replaced with string literals
  • service.version reads from package.json instead of hardcoded value
  • Redis-4 auto-instrumentation explicitly disabled

Context

  • Jira: CLDSRV-884
  • Parent investigation: OS-1072
  • Scality ADR mandates OpenTelemetry across all products
  • The storage layer (hdcontroller 1.12+ / hyperiod) already has full OTEL
  • The @opentelemetry/instrumentation-http auto-instrumentation should automatically inject traceparent into outgoing hdclient HTTP calls, connecting cloudserver traces to hdcontroller/hyperiod — this has never been tested end-to-end yet

Verification

  1. Deploy with ENABLE_OTEL=true and an OTEL collector endpoint
  2. Send S3 requests → check traces appear in Jaeger/Tempo
  3. Critical: deploy alongside hdcontroller 1.12+ (-t flag) and hyperiod (enable_tracer: true) — verify traces flow end-to-end as a single distributed trace

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Apr 2, 2026

Hello delthas,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Apr 2, 2026

Request integration branches

Waiting for integration branch creation to be requested by the user.

To request integration branches, please comment on this pull request with the following command:

/create_integration_branches

Alternatively, the /approve and /create_pull_requests commands will automatically
create the integration branches.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 2, 2026

❌ 2 Tests Failed:

Tests completed Failed Passed Skipped
8259 2 8257 0
View the top 2 failed test(s) by shortest run time
"after all" hook for "should list all versions"::aws-node-sdk test bucket versioning listing "after all" hook for "should list all versions"
Stack Traces | 0.005s run time
The bucket you tried to delete is not empty.
"after all" hook for "should create a bunch of objects and their versions"::aws-node-sdk test bucket versioning "after all" hook for "should create a bunch of objects and their versions"
Stack Traces | 0.078s run time
The bucket you tried to delete is not empty.
View the full list of 8 ❄️ flaky test(s)
should create a bunch of objects and their versions::aws-node-sdk test bucket versioning should create a bunch of objects and their versions

Flake rate in main: 13.81% (Passed 668 times, Failed 107 times)

Stack Traces | 0.175s run time
invalid versionId
should create a new version for an object::aws-node-sdk test bucket versioning should create a new version for an object

Flake rate in main: 13.81% (Passed 668 times, Failed 107 times)

Stack Traces | 0.015s run time
Indicates that the version ID specified in the request does not match an existing version.
should create delete marker and keep the null version::aws-node-sdk test bucket versioning should create delete marker and keep the null version

Flake rate in main: 13.81% (Passed 668 times, Failed 107 times)

Stack Traces | 0.003s run time
Indicates that the version ID specified in the request does not match an existing version.
should create new versions but still keep nullVersionId::aws-node-sdk test bucket versioning should create new versions but still keep nullVersionId

Flake rate in main: 13.81% (Passed 668 times, Failed 107 times)

Stack Traces | 0.012s run time
Indicates that the version ID specified in the request does not match an existing version.
should delete latest version and get the next version::aws-node-sdk test bucket versioning should delete latest version and get the next version

Flake rate in main: 13.81% (Passed 668 times, Failed 107 times)

Stack Traces | 600s run time
Timeout of 600000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves. (.../test/versioning/versioningGeneral2.js)
should enable versioning and preserve the null version::aws-node-sdk test bucket versioning should enable versioning and preserve the null version

Flake rate in main: 13.81% (Passed 668 times, Failed 107 times)

Stack Traces | 0.003s run time
Indicates that the version ID specified in the request does not match an existing version.
should retrieve an empty versioning configuration::aws-node-sdk test bucket versioning should retrieve an empty versioning configuration

Flake rate in main: 16.67% (Passed 25 times, Failed 5 times)

Stack Traces | 0.005s run time
Expected values to be strictly deep-equal:
+ actual - expected

+ {
+   Status: 'Enabled'
+ }
- {}
should update null version in versioning suspended bucket::aws-node-sdk test bucket versioning should update null version in versioning suspended bucket

Flake rate in main: 13.81% (Passed 668 times, Failed 107 times)

Stack Traces | 0.003s run time
Indicates that the version ID specified in the request does not match an existing version.

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@claude
Copy link
Copy Markdown

claude bot commented Apr 2, 2026

  • parseFloat(OTEL_SAMPLING_RATIO) can produce NaN for non-numeric input, which may cause TraceIdRatioBasedSampler to misbehave
    - Add a Number.isNaN guard with fallback to 0.01
    - args.findIndex picks the first function argument as the callback, but callbacks are always the last argument in this codebase
    - Use findLastIndex in both createInstrumentedProxy and instrumentApiMethod
    - In the callback-wrapping branch of both createInstrumentedProxy and instrumentApiMethod, a synchronous throw from the wrapped function leaks the span (never ended)
    - Wrap the apply call in try/catch to end the span
    - instrumentVault, instrumentStorage, instrumentMetadata, instrumentApi, instrumentServices are exported but never used anywhere in this PR
    - Remove them or wire them in

    Review by Claude Code

@delthas delthas force-pushed the improvement/CLDSRV-884/otel-instrumentation branch from 9c93a1d to 97e63fc Compare April 2, 2026 16:32
instrumentApi,
instrumentServices,
instrumentApiMethod,
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instrumentVault, instrumentStorage, instrumentMetadata, instrumentApi, and instrumentServices are exported but never imported anywhere else in the codebase. This is ~300 lines of dead code. If these are intended for future use, consider removing them from this PR and adding them when they are actually wired in.

— Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional for now. Must address before merging.

@claude
Copy link
Copy Markdown

claude bot commented Apr 2, 2026

  • No graceful OTEL SDK shutdown — buffered traces lost on process exit (lib/otel.js:53)
    • Register SIGTERM/SIGINT handlers calling sdk.shutdown()
  • OTEL_SAMPLING_RATIO parsed with parseFloat but NaN not guarded (lib/otel.js:35)
    • Add Number.isFinite() check with fallback to 0.01
  • ~300 lines of dead code: instrumentVault, instrumentStorage, instrumentMetadata, instrumentApi, instrumentServices exported but never imported (lib/instrumentation/simple.js)
    • Remove until wired in, or add consumers in this PR
  • Version mismatch: auto-instrumentations-node@^0.50.2 bundles older versions of instrumentation-http/express/mongodb than the direct deps — risk of duplicate instrumentation (package.json)
    • Either bump auto-instrumentations-node or remove the individual direct deps

Review by Claude Code

@delthas delthas force-pushed the improvement/CLDSRV-884/otel-instrumentation branch from 97e63fc to f978280 Compare April 2, 2026 16:35

const callbackIndex = args.findLastIndex(arg => typeof arg === 'function');

if (callbackIndex === -1) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike createInstrumentedProxy (which checks for .then on the return value), this no-callback path ends the span synchronously. If any API handler is migrated from callbacks to async/await and starts returning a promise, the span will end before the async work completes, producing misleading trace data. Add the same promise-detection logic as the proxy version for consistency.

— Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Need to check that back later.

@claude
Copy link
Copy Markdown

claude bot commented Apr 2, 2026

  • lib/otel.js:51 — No graceful shutdown hook for the OTEL SDK; pending traces will be lost on SIGTERM/SIGINT
    - Add process.on('SIGTERM'/'SIGINT', () => sdk.shutdown())
    - lib/instrumentation/simple.js:396 — instrumentApiMethod no-callback path ends spans synchronously, unlike createInstrumentedProxy which handles promises
    - Add the same .then() check for promise-returning methods
    - lib/instrumentation/simple.js:23 — Proxy get trap does not guard against Symbol-typed prop, which would crash getSpanName()
    - Add: if (typeof prop === 'symbol') return originalValue
    - lib/instrumentation/simple.js:470 — instrumentVault, instrumentStorage, instrumentMetadata, instrumentApi, instrumentServices are exported but never used (~300 lines of dead code)
    - Remove and add when actually wired in

    Review by Claude Code

@delthas delthas force-pushed the improvement/CLDSRV-884/otel-instrumentation branch from f978280 to 06eea4e Compare April 2, 2026 16:49
},
});

const callbackIndex = args.findLastIndex(arg => typeof arg === 'function');
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The callback-wrapping proxy intercepts the last function argument as the callback via findLastIndex. If the original method throws synchronously after already having invoked the callback, the span will be ended twice: once in the wrapped callback and once in the catch block. This is benign (OTEL ignores double end()) but worth documenting with a comment.

— Claude Code

@@ -0,0 +1,486 @@
'use strict';

const { trace, context, SpanStatusCode, SpanKind } = require('@opentelemetry/api');
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@opentelemetry/api is required unconditionally at the top of this file (line 3) even when ENABLE_OTEL is false. While @opentelemetry/api is a lightweight no-op API package, it still adds module loading overhead on every startup. Consider lazy-requiring it inside the enableOtel guard, consistent with how lib/otel.js and lib/server.js handle conditional requires.

— Claude Code

"@azure/storage-blob": "^12.28.0",
"@hapi/joi": "^17.1.1",
"@opentelemetry/api": "^1.9.0",
"@opentelemetry/auto-instrumentations-node": "^0.50.2",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OTEL dependency versions use caret ranges (^1.9.0, ^0.50.2, ^0.55.0, etc.). OpenTelemetry packages have had breaking changes between minor versions in the 0.x range. Consider pinning the 0.x packages to exact versions or tighter ranges (e.g. ~0.55.0 instead of ^0.55.0) to avoid unexpected breakage on yarn install.

— Claude Code

@claude
Copy link
Copy Markdown

claude bot commented Apr 2, 2026

  • Dead code: instrumentVault, instrumentStorage, instrumentMetadata, instrumentApi, and instrumentServices are exported from lib/instrumentation/simple.js but never imported anywhere. Only instrumentApiMethod is used. Remove them or wire them up.
    • Remove unused exports and the ~300 lines of proxy-wrapping code for vault/storage/metadata/api/services, or add the call sites that use them
  • Unconditional require: @opentelemetry/api is required at the top of lib/instrumentation/simple.js even when ENABLE_OTEL=false. This adds module loading overhead when OTEL is disabled.
    • Lazy-require it inside the enableOtel guard, consistent with lib/otel.js and lib/server.js
  • Loose version ranges on 0.x OTEL packages: ^0.50.2 and ^0.55.0 allow semver-minor bumps which can be breaking in 0.x.
    • Pin to exact versions or use tilde ranges (~0.55.0)
  • Double span.end() possible in callback path: If a method throws synchronously after having already called its callback, both the wrapped callback and the catch block will call span.end(). Benign but worth a comment.


Review by Claude Code

delthas added 4 commits April 2, 2026 19:02
Set up the NodeSDK with OTLP/HTTP trace exporter, gated behind
ENABLE_OTEL=true. Uses 1% sampling by default to limit collector
load. Auto-instruments HTTP, Express and MongoDB while disabling
high-volume fs and Redis instrumentations.

Issue: CLDSRV-884
Proxy-based auto-wrapping of vault, storage, metadata, api, and
services with low-cardinality span names. Instruments both
callback-based and promise-based methods transparently.

Issue: CLDSRV-884
When an inbound request carries traceparent/tracestate headers,
extract them via propagation.extract() and scope the entire
request processing under that context. This connects cloudserver
spans to upstream callers (e.g. NGINX, Beyla) as a single
distributed trace.

Issue: CLDSRV-884
Wrap all 70+ S3 API handler methods with instrumentApiMethod()
from the Proxy-based instrumentation module. Each handler gets
a dedicated span with low-cardinality name and operation type
attributes when ENABLE_OTEL=true; zero overhead when disabled.

Issue: CLDSRV-884
@delthas delthas force-pushed the improvement/CLDSRV-884/otel-instrumentation branch from 06eea4e to 75c3afb Compare April 2, 2026 17:02
instrumentApi,
instrumentServices,
instrumentApiMethod,
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instrumentVault, instrumentStorage, instrumentMetadata, instrumentApi, and instrumentServices are exported but never imported or used anywhere in this PR. This is dead code — either wire them into the server startup (e.g. wrapping vault/metadata/storage instances) or remove them until they are actually needed.

— Claude Code

@@ -0,0 +1,486 @@
'use strict';

const { trace, context, SpanStatusCode, SpanKind } = require('@opentelemetry/api');
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@opentelemetry/api is unconditionally require()d at the top of this file (line 3), even when ENABLE_OTEL is false. This means every cloudserver process loads the OTEL API module regardless of the feature flag. Since the PR description states zero overhead when OTEL is disabled, this import should be gated behind the enableOtel check, similar to how lib/otel.js and lib/server.js conditionally require it.

— Claude Code

"nan": "v2.22.0",
"fast-xml-parser": "^5.5.6"
"fast-xml-parser": "^5.5.6",
"@types/pg": "8.6.1"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The @types/pg resolution is unrelated to the OTEL instrumentation feature. If it fixes a transitive dependency issue, it should be in its own commit with an explanation of why it is needed, or removed from this PR.

— Claude Code


sdk.start();

const shutdown = () => sdk.shutdown().catch(() => {});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shutdown handler silently swallows errors with .catch(() => {}). At minimum, log the error so failed OTEL shutdowns (e.g. exporter flush failures) are visible in diagnostics.

— Claude Code

return originalValue;
}

return function (...args) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Proxy get trap returns a new wrapper function on every property access. This means obj.method !== obj.method, which breaks any code that compares function references or uses them as map keys. Consider caching the wrapped functions (e.g. in a WeakMap or a plain object keyed by prop) so the same wrapper is returned for repeated accesses of the same method.

— Claude Code

@claude
Copy link
Copy Markdown

claude bot commented Apr 2, 2026

  • @opentelemetry/api is unconditionally required in lib/instrumentation/simple.js even when ENABLE_OTEL=false, contradicting the zero-overhead claim
    • Gate the require behind the enableOtel check, like lib/otel.js and lib/server.js do
  • instrumentVault, instrumentStorage, instrumentMetadata, instrumentApi, instrumentServices are dead code (exported but never imported)
    • Either wire them into server startup or remove them until needed
  • Proxy get trap creates a new wrapper function on every property access, breaking function identity (obj.method !== obj.method)
    • Cache wrapped functions by property name
  • OTEL shutdown handler silently swallows errors with .catch(() => {})
    • Log the error to stderr at minimum
  • @types/pg resolution in package.json is unrelated to this feature
    • Move to a separate commit or remove from this PR

Review by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants