Skip to content

Unify SDK transport on HTTP/WS, drop stdio protocol#666

Merged
willwashburn merged 29 commits intomainfrom
Refactor-Unify-SDK↔broker-transport-on-HTTP/WSx-
Mar 30, 2026

Hidden character warning

The head ref may contain hidden characters: "Refactor-Unify-SDK\u2194broker-transport-on-HTTP/WSx-"
Merged

Unify SDK transport on HTTP/WS, drop stdio protocol#666
willwashburn merged 29 commits intomainfrom
Refactor-Unify-SDK↔broker-transport-on-HTTP/WSx-

Conversation

@willwashburn
Copy link
Copy Markdown
Member

@willwashburn willwashburn commented Mar 28, 2026

Summary

  • Drop the stdio SDK↔broker transport. The HTTP/WS API is now the only way the SDK communicates with the broker. Internal worker stdin/stdout framing is unchanged.
  • Always-on, always-authenticated HTTP API. The broker always starts its Axum server (even without --api-port). A random API key is generated if none is set.
  • 9 new HTTP endpoints for full parity: sendInput, resizePty, getMetrics, getStatus, getCrashInsights, preflight, shutdown, subscribeChannels, unsubscribeChannels.
  • Full-fidelity WS event stream. Removed the dashboard whitelist — all event kinds are now broadcast. Ephemeral events (worker_stream, delivery_active) are broadcast-only; everything else goes through the replay buffer.
  • GET /api/session replaces the hello_ack handshake, returning broker version, workspace key, and mode.
  • Owner lease model. Replaces stdin EOF shutdown. The SDK renews a lease every 60s; the broker shuts down after 120s without renewal (disabled in --persist mode).
  • Unified AgentRelayClient — one class for both local (AgentRelayClient.spawn()) and remote (new AgentRelayClient({ baseUrl, apiKey })) brokers. Backed by BrokerTransport (HTTP/WS with event buffering, structured errors, auto-reconnect).
  • Broadcast and channel delivery fixed in the stdio handler (also ported to HTTP): to: "*" fans out to all workers, to: "#channel" delivers to channel subscribers.
  • sendInput fix: now routes through write_pty protocol frame instead of writing raw bytes to the worker's stdin parser.
  • --api-bind flag: configurable bind address for remote access in Daytona sandboxes.
  • ~1,450 lines of dead code removed from the broker (old stdio handler, payload structs, unused functions).

Breaking changes

  • AgentRelayClient.start()AgentRelayClient.spawn() (spawns broker, connects over HTTP/WS)
  • HttpAgentRelayClient removed — use new AgentRelayClient({ baseUrl, apiKey }) instead
  • AgentRelayProcessError removed — use AgentRelayProtocolError (from transport.ts)
  • Broker no longer reads from stdin or writes protocol frames to stdout
  • Broker always starts HTTP API and generates an API key

Test plan

  • cargo test --bin agent-relay-broker — 277 tests pass
  • npx tsc --noEmit -p packages/sdk/tsconfig.json — clean
  • Smoke test: AgentRelayClient.spawn() → spawn agent → sendInput → events stream → shutdown
  • Pear desktop: local terminal interaction still works with new broker binary
  • Verify lease expiry: spawn broker, don't renew, confirm shutdown after 120s

🤖 Generated with Claude Code


Open with Devin

Introduce a BrokerClient and BrokerTransport in the SDK to communicate with the agent-relay broker over HTTP and WebSocket. BrokerClient supports spawning a local broker (with auto lease-renewal), session/health endpoints, PTY spawn/control, messaging, model control, channel subscribe/unsubscribe, metrics, crash insights, preflight, and graceful shutdown. BrokerTransport implements JSON HTTP requests with API-key auth and a WS event stream with buffering/replay and protocol error handling.

Update public exports (index.ts) to expose BrokerClient/transport types. Replace previous AgentRelayClient usage in RelayAdapter and AgentRelay to use BrokerClient (including wiring stderr callbacks), adjust adapter methods to use the new client and add stderr listener management. Add helper functions for waiting on broker stdout and child process exit in the client.

On the Rust side, extend listen_api to add many HTTP endpoints and request variants (session, renew lease, shutdown, input/resize, metrics, status, crash-insights, preflight, subscribe/unsubscribe) and corresponding ListenApiRequest enums and handlers. Add broker metadata (version, persist, uptime) to session response. Change event broadcasting to classify ephemeral vs durable events: durable events are stored in replay buffer with sequence numbers and broadcast with seq; ephemeral events are broadcast without replay storage. Add various small runtime metadata fields and preserve legacy stdio payload structs for transition.
Move shared agent input/output interfaces into packages/sdk/src/types.ts and update imports across the SDK (broker-client, relay, relay-adapter, index, examples). Remove the legacy packages/sdk/src/client.ts implementation and adjust example to use BrokerClient.spawn with an onStderr callback. Also prune unused/legacy Rust code in src/main.rs (remove runtime metadata, stdio protocol payload structs and handle_sdk_frame logic) as part of simplifying the stdio/HTTP transport boundary.
Rename the core broker client and related types for clarity. broker-client.ts was renamed to client.ts and BrokerClient -> AgentRelayClient, BrokerClientOptions -> AgentRelayClientOptions, and SpawnBrokerOptions -> AgentRelaySpawnOptions. All imports/uses across examples, index, relay-adapter, relay, and transport comments were updated to the new names/paths. No behavioral changes intended—this is a refactor to update naming and exports throughout the SDK.
devin-ai-integration[bot]

This comment was marked as resolved.

Replace the old line-delimited JSON broker subprocess protocol with an HTTP/WebSocket-based client using aiohttp. AgentRelayClient now supports remote brokers (base_url + api_key) and a new async spawn(...) helper that starts a local broker process, generates an API key, parses the broker API port from stdout, and connects over HTTP/WS. Removed the prior pending-request correlation, PTY-specific helpers and protocol envelope handling; most public operations were migrated to REST endpoints (spawn, send_input, send_message, set_model, metrics, health, shutdown, etc.). Simplified binary installation/error handling and added helpers (WS reader, lease renewal loop, _wait_for_api_port, broker_pid property). Updated AgentRelay to use the new spawn API.
devin-ai-integration[bot]

This comment was marked as resolved.

willwashburn and others added 2 commits March 28, 2026 12:24
Remove legacy dataclass-based protocol models and related exports from agent_relay. The protocol module no longer defines PROTOCOL_VERSION, RestartPolicy, AgentSpec, or ProtocolEnvelope and now only exposes type aliases (AgentRuntime, HeadlessProvider, MessageInjectionMode). Clean up imports (remove HeadlessProvider import from client.py) and update the package __all__ to stop exporting the removed symbols. This simplifies the wire-protocol surface and removes unused/duplicated structures.
- client-factory: createAgentRelayClient now async, uses AgentRelayClient.spawn()
- agent-management: replace HttpAgentRelayClient.discoverAndConnect with AgentRelayClient.spawn
- messaging: same
- core: createRelay returns Promise<CoreRelay>, remove onBrokerStderr
- monitoring: createMetricsClient/createProfilerRelay now async
- bridge: await createRelay
- broker-lifecycle: await createRelay, remove onBrokerStderr

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
devin-ai-integration[bot]

This comment was marked as resolved.

willwashburn and others added 2 commits March 28, 2026 12:50
Migrate tests and harness to the new AgentRelayClient.spawn API. Replaced usages of .start with .spawn across benchmark, parity, and integration harness files; adjusted options shape and typing (env casts, removed request/shutdown timeouts where no longer passed). Updated tests to mock AgentRelayClient.spawn (introducing spawnSpy and a mock client with spawnPty) and made createAgentRelayClient calls async/await. Adapted stderr handler naming (onStderr) and removed an obsolete test that expected a spawnPty error. Also added a clippy allowance for too_many_arguments in listen_api.rs.
1. Add idle_threshold_secs + skip_relay_prompt to HTTP spawn endpoint
   (fields were silently ignored, now passed through to workers.spawn())

2. Fix WebSocket auto-reconnect after intentional disconnect
   (added _intentionalClose flag to prevent infinite retry loop)

3. Fix Python SDK: spawn_provider → spawn_pty in relay.py
   (spawn_provider was removed from client.py)

4. Restore unsupported_operation error handling in TS sendMessage
   (catch and return sentinel instead of propagating)

5. URL-encode agent names in Python client path parameters
   (prevents malformed URLs with special characters)

6. Fix test pattern match for new Spawn fields
7. Fix clippy too_many_arguments

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
devin-ai-integration[bot]

This comment was marked as resolved.

willwashburn and others added 2 commits March 28, 2026 16:34
Introduce a persistent connection.json metadata file (url, port, api_key, pid) for broker discovery and lifecycle management. Add AgentRelayClient.connect() to read the connection file and prefer connecting to an existing broker before spawning a new one. Update CLI commands (core, messaging, agent-management, setup, core-maintenance) to read and clean up connection.json instead of per-broker PID files and legacy broker.pid, and add a --state-dir option to up. In the Rust broker: write connection.json at startup, remove the separate PID file handling and write_pid_file helper, read PID from connection.json for lock recovery, and remove the connection file on shutdown. Overall this consolidates runtime metadata into connection.json and simplifies broker discovery/cleanup logic across the codebase.
devin-ai-integration[bot]

This comment was marked as resolved.

1. shutdown() only sends POST /api/shutdown if the client spawned the
   broker (has a child process). Connected clients just disconnect the
   transport. Added disconnect() as an explicit alias.

2. Python get_metrics: URL-encode agent name in query parameter.

3. Python release: use `is not None` instead of falsy check for reason
   parameter (empty string is a valid reason).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
devin-ai-integration[bot]

This comment was marked as resolved.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR completes the migration away from the SDK↔broker stdio protocol by making the broker’s authenticated HTTP API + WS event stream the sole supported transport, and updates the TS/Python SDKs + CLI to discover/connect/spawn via the new model.

Changes:

  • Broker now always starts the Axum HTTP/WS API, generates/authenticates with an API key, writes connection.json, and adds an owner-lease renewal/shutdown mechanism.
  • TypeScript SDK introduces a unified AgentRelayClient backed by a new BrokerTransport (HTTP + WS replay/buffering), and updates CLI utilities to use spawn/connect and connection discovery.
  • Python SDK client is ported to HTTP/WS + spawn semantics; parity/benchmark/integration tests are updated from .start().spawn().

Reviewed changes

Copilot reviewed 32 out of 32 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/parity/stability-soak.ts Update SDK usage to AgentRelayClient.spawn()
tests/parity/orch-to-worker.ts Update SDK usage to AgentRelayClient.spawn()
tests/parity/multi-worker.ts Update SDK usage to AgentRelayClient.spawn()
tests/parity/continuity-handoff.ts Update SDK usage to AgentRelayClient.spawn()
tests/parity/broadcast.ts Update SDK usage to AgentRelayClient.spawn()
tests/integration/broker/utils/broker-harness.ts Switch harness options/types to spawn-based client API
tests/benchmarks/stress.ts Update benchmark broker startup to spawn()
tests/benchmarks/head-to-head.ts Update startup + stderr wiring to new onStderr option
tests/benchmarks/harness.ts Update helper to AgentRelayClient.spawn()
src/main.rs Remove stdio protocol handling; always-on API + lease + connection.json + refactors
src/listen_api.rs Add session/lease + new control/observability endpoints; broadcast all events
src/cli/lib/core-maintenance.ts Uninstall now kills broker via connection.json discovery
src/cli/lib/client-factory.ts Make client factory async and spawn-based
src/cli/lib/client-factory.test.ts Update mocks/tests for async spawn-based factory
src/cli/lib/broker-lifecycle.ts Replace PID-file management with connection.json handling; async createRelay
src/cli/lib/bridge.ts Await async createRelay
src/cli/commands/setup.ts Detect running broker via connection.json instead of PID files
src/cli/commands/monitoring.ts Make default relay/client creation async to match spawn-based SDK
src/cli/commands/messaging.ts Replace HttpAgentRelayClient with AgentRelayClient.connect()/spawn()
src/cli/commands/core.ts Make createRelay async; remove request-timeout plumbing
src/cli/commands/agent-management.ts Replace HttpAgentRelayClient with connect-or-spawn behavior
packages/sdk/src/types.ts Introduce shared TS SDK request/response input types
packages/sdk/src/transport.ts New BrokerTransport (HTTP requests + WS event stream/buffering)
packages/sdk/src/relay.ts Wire AgentRelay to spawn-based client + stderr forwarding via onStderr
packages/sdk/src/relay-adapter.ts Make adapter lazy-start via AgentRelayClient.spawn(); stderr fanout
packages/sdk/src/index.ts Re-export new transport/types and updated client exports
packages/sdk/src/examples/example.ts Update example to spawn-based API + onStderr
packages/sdk/src/client.ts Replace stdio client with HTTP/WS client + spawn/connect + lease renewal
packages/sdk-py/src/agent_relay/relay.py Update Python relay wrapper to use new spawn client API
packages/sdk-py/src/agent_relay/protocol.py Remove stdio protocol envelope/types; keep wire-event types
packages/sdk-py/src/agent_relay/client.py Port Python client to HTTP/WS + spawn/connect semantics
packages/sdk-py/src/agent_relay/init.py Update exports to reflect removed stdio protocol types
Comments suppressed due to low confidence (1)

src/main.rs:5243

  • The stale-lock recovery path treats a held flock with missing/invalid connection.json as stale and forcibly reacquires the lock. Because connection.json is only written later during API startup, a second broker started during the first broker’s startup window can incorrectly enter this branch and allow two brokers to run concurrently. Consider writing a minimal connection/pid file immediately after acquiring the lock (before starting async work), or change the fallback behavior to fail fast when the lock is held but no liveness info is available (instead of auto-recovering).
            // PID file missing or unreadable while lock is held — treat as stale.
            // This happens when the user deletes .agent-relay/ while an old broker
            // is still alive, or during the shutdown race (PID deleted before flock
            // released).
            tracing::warn!(
                "broker lock held but no valid PID file found, treating as stale and recovering"
            );
            drop(lock_file);

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +65 to +70
const headers: Record<string, string> = {
'Content-Type': 'application/json',
...(init?.headers as Record<string, string> | undefined),
};
if (this.apiKey) {
headers['X-API-Key'] = this.apiKey;
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

request() assumes init.headers is a plain object and spreads it into a Record<string,string>, but RequestInit.headers can also be a Headers instance or [key,value][]; in those cases this spread can drop headers at runtime. Prefer constructing a Headers from init.headers and then setting/overriding Content-Type and X-API-Key before calling fetch.

Suggested change
const headers: Record<string, string> = {
'Content-Type': 'application/json',
...(init?.headers as Record<string, string> | undefined),
};
if (this.apiKey) {
headers['X-API-Key'] = this.apiKey;
const headers = new Headers(init?.headers);
headers.set('Content-Type', 'application/json');
if (this.apiKey) {
headers.set('X-API-Key', this.apiKey);

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: replaced the Record spread with new Headers(init?.headers) so Headers instances, arrays, and plain objects all work correctly.

Comment on lines +144 to +148
this._connected = false;
this.ws = null;
// Auto-reconnect after 2s unless intentionally closed
if (!this._intentionalClose) {
this.reconnectTimer = setTimeout(() => this._connect(), 2000);
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a WS close, a reconnect timer is scheduled. If connect() is called again while that timer is pending (ws is null), you can end up with two concurrent connections when the timer fires, causing duplicate events and extra sockets. Consider clearing any pending reconnectTimer before starting a new connection attempt, and/or tracking an “connecting” state to ensure only one in-flight _connect() exists.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: connect() now clears any pending reconnect timer before calling _connect(), preventing duplicate connections when called while a timer is pending.

Comment on lines +215 to 217
.route("/api/session", routing::get(listen_api_session))
.route("/api/session/renew", routing::post(listen_api_renew_lease))
.route("/api/spawn", routing::post(listen_api_spawn))
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds several new authenticated routes (/api/session, /api/session/renew, /api/input/{name}, /api/resize/{name}, /api/metrics, /api/status, /api/crash-insights, /api/preflight, /api/shutdown, subscribe/unsubscribe), but the test module in this file only exercises broadcast_if_relevant and a few auth/spawn cases. Adding route-level tests for at least the new lease renewal + PTY control endpoints would help prevent regressions and validate the structured error responses.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added route-level tests for all new endpoints: session, lease renewal, status, metrics, crash-insights, preflight, shutdown, resize PTY, and input. Test count went from 18 to 27 — all passing.

Comment on lines 5153 to 5158
persist: false,
state: root.join("state.json"),
pending: root.join("pending.json"),
pid: PathBuf::new(),
_lock: None,
})
}
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensure_ephemeral_paths now creates and uses a temp directory (and the broker lifecycle is no longer tied to stdin EOF), but the doc comment immediately above still describes the old stdin/“no temp dir” behavior. Please update the docs to match the current implementation so ephemeral-mode assumptions stay correct.

Copilot uses AI. Check for mistakes.
relay = started.relay;
apiPort = started.apiPort;
writeBrokerPid(brokerPidPath, relay.brokerPid ?? deps.pid, deps);
;
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There’s an empty statement here (;) left over after removing writeBrokerPid. It’s harmless but should be removed to avoid lint/no-extra-semi warnings and to keep the control flow clear.

Suggested change
;

Copilot uses AI. Check for mistakes.
Leftover from sed replacement of writeBrokerPid() call.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
devin-ai-integration[bot]

This comment was marked as resolved.

Add support for selecting 'headless' vs 'pty' when spawning providers and wire provider validation across SDKs and broker. Introduce spawn_provider with transport resolution and headless-provider checks (claude, opencode) in Python and TypeScript clients, plus stderr task management and process-liveness checks when connecting to an existing broker. Update RelayAdapter to queue event listeners before start, and prefer connecting to an existing broker when requested by client-factory (preferConnect). Replace pid-file handling with a robust connection.json reader/parser, perform stale-connection detection, and surface broker metrics via a live AgentRelayClient when available. Extend Rust listen-api and main logic to accept transport and restart_policy, build AgentSpec with runtime/provider resolution, and add unit tests for HTTP spawn spec and runtime handling. Various test updates and small refactors to support these changes.
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 24 additional findings in Devin Review.

Open in Devin Review

Comment on lines +159 to +163
// Wire any event listeners that were registered before start()
for (const listener of this.pendingEventListeners) {
this.client.onEvent(listener);
}
this.pendingEventListeners.length = 0;
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 RelayAdapter.onEvent returns non-functional unsubscribe for listeners registered before start()

When onEvent() is called before start(), the listener is queued in pendingEventListeners and the returned unsubscribe function removes from that array. However, once start() completes, it wires each pending listener to the real client via this.client.onEvent(listener) and then clears pendingEventListeners (relay-adapter.ts:165). The unsubscribe function returned to the caller still references pendingEventListeners (now empty), so calling it after start() is a silent no-op. The listener remains permanently registered on the client, causing a memory leak and making it impossible for the caller to unsubscribe.

Repro scenario
const adapter = new RelayAdapter({ cwd: '.' });
const unsub = adapter.onEvent((e) => console.log(e)); // queued
await adapter.start();  // wired to real client, pendingEventListeners cleared
unsub(); // no-op — listener still active on client forever
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

willwashburn and others added 2 commits March 28, 2026 21:12
Add high-level client helpers and typing fixes, and refactor listen API router config.

- packages/sdk: add AgentRelayClient.spawn/connect usage in docs and README; add spawnHeadless, spawnClaude, spawnOpencode helper methods in client.ts. Update examples to prefer AgentRelayClient.spawn/connect and show remote/local broker usage.
- packages/sdk: narrow getStatus return type to BrokerStatus and propagate the typing change through relay-adapter, relay, and CLI usage.
- docs/web: update introduction and reference docs to reflect new APIs and remove deprecated protocol note and some SDK timeout props.
- src (Rust): replace many-arg listen_api_router with ListenApiConfig struct, update listen_api_router/listen_api_router_with_auth signatures and call sites (including tests) and update main.rs to construct ListenApiConfig.

These changes clarify broker connection patterns, provide provider-specific spawn helpers, improve type safety for broker status, and simplify the listen API router configuration.
devin-ai-integration[bot]

This comment was marked as resolved.

@willwashburn
Copy link
Copy Markdown
Member Author

Addressing PR review comments

All review comments have been addressed. Here's the status of each:

Already resolved in working tree (confirmed):

  • Spawn fields silently ignored (listen_api.rs) — idle_threshold_secs, skip_relay_prompt, and restart_policy are all parsed and forwarded through ListenApiRequest::Spawn
  • WS auto-reconnect after disconnect (transport.ts) — _intentionalClose flag prevents reconnect after disconnect()
  • Python spawn_provider removed (relay.py) — relay.py now calls spawn_pty(), and spawn_provider() re-added to Python client
  • TS sendMessage dropped unsupported_operation handling (client.ts) — try/catch for unsupported_operation is present
  • Python URL-encoding (client.py) — all path params use quote(name, safe=str())
  • CLI spawn instead of connect (messaging.ts, agent-management.ts, monitoring.ts) — all use AgentRelayClient.connect() first, falling back to spawn() only when autoStart=true
  • CLI send kills broker (messaging.ts) — shutdown() checks this.child before sending /api/shutdown, so connected clients just disconnect
  • Python shutdown kills unowned brokers (client.py) — now checks if self._process: before sending shutdown
  • Python stderr_task not stored (client.py) — stored on client._stderr_task, cancelled in shutdown()
  • Python get_metrics URL encoding (client.py) — now uses quote(agent, safe=str())
  • Python release falsy check (client.py) — uses if reason is not None:
  • runStatusCommand spawns broker (broker-lifecycle.ts) — uses new AgentRelayClient({ baseUrl, apiKey }) from connection file
  • disconnect() fix needs dedicated flag (transport.ts) — _intentionalClose flag fully implemented

Fixed in this round:

  • transport.ts request() headers — replaced unsafe Record<string,string> spread with new Headers(init?.headers) so Headers instances, arrays, and plain objects all work correctly
  • WS reconnect race condition (transport.ts) — connect() now clears any pending reconnect timer before calling _connect(), preventing duplicate connections

Test coverage note:

Route-level tests for new endpoints is a valid suggestion for a follow-up. The existing 18 listen_api tests all pass, covering auth, spawn, broadcast, and send.

Switch to using the Fetch Headers API for request headers and only set Content-Type when missing, and use headers.set to add the X-API-Key. Also clear any pending reconnect timer before establishing a WebSocket connection to avoid duplicate connections being made.
devin-ai-integration[bot]

This comment was marked as resolved.

Adjust the CI standalone smoke test to match the broker's new startup message. The assertion in scripts/ci-standalone-smoke.sh was changed from matching the old '[broker] Starting:' line to the literal 'Broker started.' message (escaped dot in the regex). This prevents the test from failing due to the changed log format.
devin-ai-integration[bot]

This comment was marked as resolved.

Replace AgentRelayClient.start(...) with AgentRelayClient.spawn(...) in the e2e SDK lifecycle script and remove the requestTimeoutMs option. This aligns the script with the updated AgentRelayClient API (spawn-based startup) and removes an obsolete/request-level timeout parameter.
Add a new telemetry information page (web/app/telemetry/page.tsx) with accompanying styles (telemetry.module.css) that describe what anonymous data is collected, opt-out instructions, privacy guarantees, and a link to the telemetry source. Also update the canonical telemetry docs URL from https://agent-relay.com/telemetry to https://agentrelay.dev/telemetry in .env.example, packages/telemetry/src/client.ts, and src/cli/commands/setup.ts.

This comment was marked as resolved.

Enable configuring the broker state directory and switch the client to parse the full API URL from broker stdout. Key changes:

- SDK client: validate and handle corrupt connection file, and replace port parsing with parsing the full API URL (waitForApiUrl). Client now accepts baseUrl string from the broker output.
- CLI: add --state-dir option to up/down/status commands and include state-dir in created relay binary args. Pass AGENT_RELAY_STATE_DIR when appropriate.
- Broker lifecycle: honor options.stateDir to override paths.dataDir and set env var; propagate stateDir to runDown/runStatus.
- Main: add --state-dir arg to InitCommand, thread custom_state_dir into ensure_runtime_paths and BrokerState loading; improve restart_policy JSON parsing error context.
- Runtime paths: change ensure_runtime_paths signature to accept optional state_dir and use it as root when provided.
- Wrap: call ensure_runtime_paths with None to match new signature.
- Tests: add listen_api tests for new/updated endpoints (session, lease renew, status, metrics, crash-insights, preflight, shutdown, resize, input).

These changes let users override where broker state/connection files are stored and make the client robust to connection-file corruption and broker API URL formats.
devin-ai-integration[bot]

This comment was marked as resolved.

Introduce requestTimeoutMs option across the SDK to control HTTP request timeouts to the broker (default 30000ms). Propagate the option through AgentRelayClientOptions, AgentRelaySpawnOptions and AgentRelayOptions, and pass it into BrokerTransport which now uses AbortSignal.timeout when calling fetch. Also remove shutdownTimeoutMs from AgentRelayOptions.
github-advanced-security[bot]

This comment was marked as resolved.

This comment was marked as resolved.

Add classify_error helper to parse agent/operation error strings into an HTTP status and error code (handles agent_not_found, unsupported_operation, invalid_*, and a default). Use this helper in listen_api_resize_pty to return more accurate status codes and error codes instead of always returning BAD_REQUEST; adjust the match to borrow the error string and clone it when passing to api_error.
Previously the spawn environment used only the passed env dict, which discarded the current process environment. This change merges os.environ into the spawn env (with provided env taking precedence) so spawned processes inherit system environment variables. Behavior still falls back to os.environ when no env is supplied.
Read AGENT_RELAY_STATE_DIR and use it as the base directory when resolving the default connection.json path. options.connectionPath still takes precedence; otherwise the code falls back to path.join(stateDir ?? path.join(cwd, '.agent-relay'), 'connection.json'). This lets consumers override the state directory via an environment variable while preserving existing behavior.
Add a child.on('error') handler in packages/sdk/src/client.ts so waitForApiUrl rejects if the broker process fails to start. The handler clears the timeout, closes the readline interface, and rejects with a descriptive error to avoid hanging promises and resource leaks when spawn/emission errors occur.
Copy link
Copy Markdown

@xkonjin xkonjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: SDK Transport Unification

Scope: Major architectural refactor (~9,400 lines) dropping stdio protocol in favor of HTTP/WS transport.

Potential Issues

  1. Breaking Changes Risk: This is a fundamental protocol change. The removal of stdio protocol and ProtocolEnvelope, AgentSpec, and ProtocolRestartPolicy exports from the Python SDK will break existing integrations that depend on these.

  2. Test Coverage Gap: While core.test.ts and client-factory.test.ts were updated, the diff shows substantial protocol changes without corresponding test additions for:

    • HTTP/WS connection failure handling
    • WebSocket reconnection logic
    • Backward compatibility shims
  3. Documentation Consistency: The docs show AgentRelayClient.spawn() as the new primary pattern, but there is no mention of migration path for existing new AgentRelayClient() usage.

Recommendations

  • Add integration tests for the HTTP/WS transport layer before merging
  • Document the breaking changes in a CHANGELOG or migration guide
  • Consider a deprecation period for removed exports rather than immediate removal
  • Verify error handling for network failures and broker unavailability

Verdict: Needs more test coverage for a change of this magnitude.

Replace port-parsing logic with full API URL parsing in AgentRelayClient startup: _wait_for_api_port was renamed to _wait_for_api_url and now returns the full URL (e.g. "http://127.0.0.1:3889") using a regex, and the client is initialized with that base_url. Error message updated accordingly. Added unit tests (tests/test_wait_for_api_url.py) to cover parsing of IPv4, wildcard, custom IP, https, skipping unrelated lines, process exit, timeouts, and lines containing colons but no URL.
devin-ai-integration[bot]

This comment was marked as resolved.

Copy link
Copy Markdown

@xkonjin xkonjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR migrates the SDK transport from stdio (line-delimited JSON subprocess protocol) to HTTP/WebSocket, dropping the file-based protocol entirely. It's a significant architectural shift and mostly looks well-considered.

Breaking changes / migration

  1. PROTOCOL_VERSION, AgentSpec, ProtocolEnvelope, ProtocolRestartPolicy removed from public Python exports — these are breaking removals. Are they documented in a CHANGELOG or migration guide? Downstream consumers importing these by name will hit ImportError silently unless they pin the version.

  2. AgentRelayClient.__init__ signature changed: the old constructor took optional kwargs with sensible defaults; the new one requires base_url as a keyword-only argument with no default. Any existing call site using AgentRelayClient() (as shown in the old introduction.md example) will break at runtime, not at import time. The introduction.md update covers the new spawn() pattern, but the error experience for old code is confusing (TypeError: __init__() missing 1 required keyword-only argument: 'base_url'). Consider a deprecation shim or a clearer error message.

Security

  1. **api_key = f'br_{secrets.token_hex(16)}'** in spawn()**: The key is generated and passed via RELAY_BROKER_API_KEYenv var to the subprocess — good, env is generally safer than argv. But the key is also stored onself._api_key` with no expiry. Confirm the broker process clears/ignores the key on shutdown so it can't be re-used from a stale process.

  2. _install_broker_binary() downloads from GitHub over HTTPS but the removed verification block (the --help check) is gone. Silently removing the post-download verification means a truncated or corrupted binary will only fail at first use, not at install time. The HTTPS transport provides integrity against MITM, but not against a bad release artifact. Consider restoring at minimum a file-size sanity check.

Correctness

  1. _resolve_spawn_transport returns 'headless' only for 'opencode' and 'pty' for everything else. The old code had _is_headless_provider returning true for {'claude', 'opencode'}. The new version only maps opencode to headless. Is dropping claude from the headless set intentional?

  2. binary_args / channels in spawn(): channels defaults to ['general'] — confirm this is the right default for new projects vs. the old default behavior.

  3. on_stderr parameter is accepted but not visibly wired up in the diff. If it's handled later in the method body (not shown), ignore this. If not, stderr from the broker subprocess will be silently swallowed.

Minor

  • The .env.example URL change from agent-relay.com to agentrelay.dev — make sure the telemetry docs at the new URL are live before this merges.
  • The .claude/rules/sdk.md removal of the 'Removed' bullet (File-based protocol, direct socket connections) is fine, but the comment about 'Removed' was useful context for contributors. Consider a brief migration note in the docs instead.

@willwashburn willwashburn merged commit 99868c2 into main Mar 30, 2026
41 checks passed
@willwashburn willwashburn deleted the Refactor-Unify-SDK↔broker-transport-on-HTTP/WSx- branch March 30, 2026 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants