Skip to content

Native OpenTelemetry tracing for swamp CLI internals #677

@bixu

Description

@bixu

Problem

Operators using swamp have no visibility into what the CLI is doing internally during model method execution, workflow runs, or datastore operations. When something is slow or fails, there's no structured observability data to debug cause-and-effect chains. Users currently have to rely on log output and manual timing to understand performance or diagnose issues.

Proposed Solution

Add native OpenTelemetry instrumentation to swamp's core execution paths, emitting traces that capture the full lifecycle of operations. This would give operators the ability to:

  • See cause-and-effect chains: A single trace spanning workflow → job → step → method execution → resource writes, showing how operations compose
  • Debug performance issues: Identify slow vault resolutions, CEL evaluations, datastore syncs, lock acquisitions, or method executions with precise timing
  • Correlate with extension traces: User-defined extension models (like @bixu/github/repo) already emit their own OTel spans — native swamp tracing would let these appear as children of the CLI's orchestration spans, producing a complete end-to-end trace

Key instrumentation points

  • CLI command dispatch (root span per invocation)
  • Repository initialization (datastore sync, lock acquisition)
  • Model method execution (CEL evaluation, argument resolution, vault lookups, method execute() call)
  • Workflow orchestration (workflow run → job → step, with data chaining resolution)
  • Data lifecycle (resource writes, garbage collection)

Transport and configuration

  • Default to OTLP/HTTP (/v1/traces endpoint) for maximum compatibility in heterogeneous network environments (proxies, load balancers, firewalls that may not support gRPC)
  • Configuration via environment variables following OTel conventions:
    • OTEL_EXPORTER_OTLP_ENDPOINT — collector endpoint
    • OTEL_EXPORTER_OTLP_HEADERS — auth headers (e.g. x-honeycomb-team=<key>)
    • OTEL_SERVICE_NAME — defaults to swamp
    • OTEL_TRACES_EXPORTER — defaults to otlp (set to none to disable)
  • Tracing should be off by default (zero overhead when not configured) and activate when OTEL_EXPORTER_OTLP_ENDPOINT is set

Signal priority

If only one OTel signal can be enabled in the first iteration, it should be traces. Traces provide the most immediate value for understanding swamp's execution model, which is inherently hierarchical (workflow → job → step → method → API call). Metrics and logs can follow later.

Alternatives Considered

  • Structured logging only: Provides some observability but lacks the hierarchical parent-child relationships that make traces valuable for understanding swamp's execution model
  • Extension-only tracing (current state): Extensions like @bixu/github/repo can emit their own spans, but without native swamp instrumentation these spans are orphaned — there's no parent context from the CLI's orchestration layer to connect them into a complete trace

Additional Context

We've built a @bixu/opentelemetry extension model and added tracing to @bixu/github/repo as a proof-of-concept. The extension-level tracing works well but highlights the gap: we can see the GitHub API calls but not the swamp orchestration around them (vault resolution, CEL evaluation, workflow scheduling). Native instrumentation would close that gap.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestexternalAn issue raised by an external contributor

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions