Skip to content

feat(environments): per-agent dev environment isolation with tooling#44

Open
dennisonbertram wants to merge 2 commits intomainfrom
dennisonbertram/dev-environments
Open

feat(environments): per-agent dev environment isolation with tooling#44
dennisonbertram wants to merge 2 commits intomainfrom
dennisonbertram/dev-environments

Conversation

@dennisonbertram
Copy link
Owner

Summary

  • Adds sandboxed dev environments as a new resource type alongside services and databases
  • Each environment is a gVisor-isolated container with persistent /workspace volume, WebSocket shell exec, and file transfer
  • Auto-stops after configurable idle timeout (default 30min) to save resources
  • Environments share the tenant's internal Docker network, so they can reach tenant services/databases
  • Allowed base images: node:20, python:3.12, golang:1.25, ubuntu:24.04

New API surface

Method Path Description
POST /v1/environments Create dev environment
GET /v1/environments List environments (paginated)
GET /v1/environments/{id} Get environment details
DELETE /v1/environments/{id} Delete environment
POST /v1/environments/{id}/start Start stopped environment
POST /v1/environments/{id}/stop Stop running environment
GET /v1/environments/{id}/exec WebSocket bidirectional exec
POST /v1/environments/{id}/files Upload file to /workspace
GET /v1/environments/{id}/files/* Download file from /workspace

Files changed

  • New: internal/environments/ — manager, idle detector, tests (16 passing)
  • New: internal/api/environments.go — HTTP/WebSocket handlers
  • New: internal/db/migrations/state_010_environments.sql — schema + quota
  • Modified: Docker client (6 new methods), mock, reconciler, GC, server, main

Test plan

  • All 16 new environment tests pass
  • All existing tests still pass
  • go build ./... clean
  • Manual test: create environment, exec ls, upload/download file
  • Verify idle auto-stop works with short timeout
  • Verify environment can reach tenant's services on internal network

Closes #41

🤖 Generated with Claude Code

Add sandboxed dev environments as a new resource type alongside services
and databases. Each environment is a gVisor-isolated container with a
persistent /workspace volume, shell access via WebSocket exec, and file
upload/download support.

API endpoints:
- POST/GET /v1/environments — create and list environments
- GET/DELETE /v1/environments/{id} — get and delete
- POST /v1/environments/{id}/start|stop — lifecycle control
- GET /v1/environments/{id}/exec — WebSocket bidirectional exec
- POST/GET /v1/environments/{id}/files — upload/download files

Key design decisions:
- WebSocket exec (not SSH) for simpler proxy, no key management
- Writable filesystem (unlike services) for package installation
- Persistent /workspace volume survives container restarts
- Auto-stop after configurable idle timeout (default 30min)
- Same per-tenant network so environments reach tenant's services/DBs
- Allowed base images: node:20, python:3.12, golang:1.25, ubuntu:24.04

Includes: migration (state_010), environment manager, idle detector,
Docker client extensions, reconciler/GC updates, 16 passing tests.

Closes #41

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2687abb96f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +378 to +380
_, err = m.db.ExecContext(ctx,
`UPDATE environments SET status = 'running', last_activity_at = ?, updated_at = ? WHERE id = ?`,
now, now, envID)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reject start when stopped environment has no container

After reconciliation, missing environment containers are explicitly cleared (container_id = ''), but Start still updates the record to status='running' even when there is no container to start. In this state the API reports success yet /exec and /files calls fail because GetContainerID returns “environment has no container”. Start should fail (or recreate the container) instead of transitioning to running with an empty container_id.

Useful? React with 👍 / 👎.

Comment on lines +164 to +166
if err := tx.Commit(); err != nil {
return nil, fmt.Errorf("commit quota check: %w", err)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep quota check and insert in one transaction

The environment quota count is committed before the new row is inserted, so concurrent creates can both pass the count check and then both insert, exceeding max_environments. This is a classic check-then-act race; quota enforcement needs to include both the count check and insert in the same serialized transaction (or an equivalent atomic SQL pattern).

Useful? React with 👍 / 👎.

Comment on lines +261 to +263
r.Body = http.MaxBytesReader(w, r.Body, 50<<20)

if err := r.ParseMultipartForm(50 << 20); err != nil {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fix effective upload limit for environment file API

This handler advertises a 50MB multipart limit, but requests are already wrapped by the global 1MB maxBodySize middleware in setupRoutes, so uploads over 1MB are rejected before this limit is applied. As written, the file-upload endpoint cannot support the documented/expected 50MB payloads.

Useful? React with 👍 / 👎.

…eview

CRITICAL fixes:
- WebSocket CheckOrigin now rejects browser requests with Origin header
  to prevent cross-site WebSocket hijacking
- ExecAttach returns proper close function to prevent FD/connection leaks
  on every exec session
- Reorganize maxBodySize middleware per-group so 50MB file uploads work
  (global 1MB cap was blocking them)
- Safe string truncation in container naming to prevent panics on short IDs
- Path traversal hardening: strict /workspace/ prefix check, tar header
  filename sanitization against ../../ attacks

HIGH fixes:
- Atomic quota check + INSERT in single transaction for serialization
- WebSocket DoS controls: 1MB read limit, 60s deadlines, ping/pong
- File download validates exact filename match (not first file in tar)
- Global rate limiter added to WebSocket route group
- Fix RunDatabase comment to match gVisor runtime code

MEDIUM fixes:
- TouchActivity now tenant-scoped (defense in depth)
- Start/Stop SQL includes tenant_id in WHERE clause
- IdleDetector: remove double rows.Close(), check rows.Err()

LOW fixes:
- Circuit breaker backoff uses actual circuit_open_count
- Content-Disposition uses %q for proper filename quoting

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

spike: define the smallest viable dev-environment resource

1 participant