feat: format-based image registry rewrite (replaces scion-* heuristic)#8
Open
zeroasterisk wants to merge 237 commits into
Open
feat: format-based image registry rewrite (replaces scion-* heuristic)#8zeroasterisk wants to merge 237 commits into
zeroasterisk wants to merge 237 commits into
Conversation
…rm#293) * fix(scion-chat-app): set channel="gchat" on ask_user dialog responses handleDialogSubmit was using the simple SendMessage API which doesn't support structured message fields, so inbound ask_user responses arrived at the hub with no channel set (defaulting to "web"). Switch to SendStructuredMessage with Channel="gchat" to match the pattern already used by cmdMessage. * fix: channel filtering and thread-id routing for chat channel replies Two bugs in the chat channel routing feature: 1. Channel filtering: broker plugins now check msg.Channel and skip messages targeted at a different channel. The hub injects plugin_name into broker credentials so each plugin knows its own channel identity. This prevents cross-channel delivery (e.g., Telegram replies leaking to Google Chat). 2. Thread-id routing: the Telegram plugin now passes msg.ThreadID as message_thread_id to the Telegram Bot API when sending outbound messages. Previously, thread-id was captured on inbound messages but never forwarded on outbound, causing replies to land in the wrong forum topic. Added SendOption variadic parameter to SendMessage, SendMessageWithKeyboard, and SendQueue.Send for backward-compatible thread-id support. * feat(scion-chat-app): add Google Chat thread context support Propagate thread IDs end-to-end so agents can participate in Google Chat threads: - Inbound: auto-set ThreadID on StructuredMessage from the Google Chat event's thread context when no explicit --thread flag is used - Inbound: propagate ThreadID on dialog submit (ask_user responses) - Outbound: pass ThreadID from StructuredMessage to SendMessageRequest so agent replies land in the correct Google Chat thread * fix: route outbound messages to chat-app via ChannelID The FanOutEventBus matched msg.Channel against the bus Name, but the chat-app plugin is registered as "chat-app" while its messages use channel="gchat". Add a ChannelID field to NamedEventBus and PluginInfo so plugins can declare the channel they handle independently of their registered name. The chat-app now reports ChannelID="gchat" via GetInfo(), and the hub reads it at startup to wire routing correctly. * design: per-topic /default agent scoping for Telegram forums Explores how to let /default set a different default agent per forum topic (message_thread_id) rather than per-chat. Conclusion: ~85 lines of changes across store, commands, callbacks, and routing. * feat(scion-telegram): per-topic /default agent scoping for forum groups Add support for setting a different default agent per Telegram forum topic/thread, with the chat-wide default as fallback. - New topic_defaults table keyed on (chat_id, thread_id) - /default in a topic sets/shows the topic-level override - Callback data extended: dflt:<slug>:<threadID> for topic scope - Routing resolves topic default before chat default for both @bot-mention and unaddressed message fallback paths * fix: address PR GoogleCloudPlatform#293 review feedback - Add !no_sqlite build tag to resource_import_handler_test.go to fix CI vet failure (mockRoundTripper undefined when template_bootstrap_test.go is excluded) - Guard debug log in broker.go Publish against nil msg to prevent panic - Add fitCallback to preserve threadID suffix in Telegram callback_data when the 64-byte limit is exceeded, truncating agentSlug instead - Add slog warning to truncateCallback when truncation occurs * fix: address second round of PR GoogleCloudPlatform#293 review feedback - Remove redundant channel filters from chat-app and Telegram Publish() methods — the FanOutEventBus already routes by ChannelID, and comparing against the plugin's registered name would silently drop messages - Log errors from GetTopicDefault instead of silently ignoring them - Return distinct error messages in chat-app when ResolveOrAutoRegister fails with a real error vs a nil mapping * fix: address third round of PR GoogleCloudPlatform#293 review feedback - Add early return for nil msg at top of Publish() to prevent panics in downstream handlers that dereference msg fields - Add thread-safe ChannelName() getter on BrokerServer - Use dynamic ChannelName() in GetInfo() instead of hardcoded "gchat" - Use dynamic ChannelName() in both commands.go call sites * fix: use callback_lookups for long callback data instead of truncation Replace fitCallback() which corrupted agent slugs by truncating them to fit Telegram's 64-byte limit. Long callback payloads are now stored in the callback_lookups table with a short cblu:<id> reference. HandleCallback resolves lookup IDs before routing. Also add defensive check for empty HubUserEmail in chat-app to prevent constructing invalid "user:" sender strings. * fix: address fifth round of PR GoogleCloudPlatform#293 review feedback - Use local interface instead of concrete *BrokerRPCClient type assertion in pluginChannelID() and isObserverBroker() so in-process brokers and mocks are handled correctly. - Add nil guard for msg in fanout channel routing check. --------- Co-authored-by: Scion <agent@scion.dev>
…eCloudPlatform#296) * Fix test suite leaking Hub credentials, corrupting agent state (GoogleCloudPlatform#123) Tests that spawn sciontool (e.g., TestInitCommand_Integration) inherited live Hub env vars from the agent container, causing the subprocess to talk to the real Hub and reset the agent phase to "starting." - Add scrubHubEnv(t) helpers that use t.Setenv to clear Hub env vars (SCION_HUB_ENDPOINT, SCION_HUB_URL, SCION_AUTH_TOKEN, SCION_AGENT_ID, SCION_AGENT_MODE) with automatic restore on test cleanup - Filter Hub env vars from subprocess Cmd.Env in TestInitCommand_Integration as belt-and-suspenders protection - Convert os.Setenv/os.Unsetenv to t.Setenv throughout hub_test.go and client_test.go for crash-safe env var isolation * Add project log entry for issue GoogleCloudPlatform#123 fix * Address PR GoogleCloudPlatform#296 review feedback in init_test.go Replace hardcoded /tmp/sciontool-test path with t.TempDir() to avoid permission conflicts and test races. Replace map allocation in filterHubEnv with slices.Contains on the static hubEnvVars slice.
…oogleCloudPlatform#299) Three new documentation pages: - External Channels: covers Telegram (bidirectional group chat), Discord (outbound webhooks), and A2A protocol bridge in one page. Summarizes concepts and links to detailed READMEs in extras/. - Hub Setup on GCE: step-by-step walkthrough of deploying a hub using the starter-hub scripts. Covers provisioning, repo setup, TLS, and post-setup next steps. - Multi-Broker Setup: how to connect multiple machines to a single hub for distributed agent execution. Covers architecture, broker registration, selection, and cross-broker considerations. Sidebar updated to include all three pages.
* Add sort and filter capabilities to agent list view (GoogleCloudPlatform#71) CLI: add --phase, --activity, --template filter flags and --sort, --reverse sort flags to 'scion list'. Validates flag values against known phases/activities. Passes phase filter server-side in hub mode for efficiency. Web UI: add phase filter chips (All/Running/Stopped/Suspended/Error), sortable table headers (Name, Status, Updated), and sort dropdown for grid view. Filter and sort state persists to localStorage. Closes GoogleCloudPlatform#71 * Address review feedback: input canonicalization and validation - CLI: canonicalize --phase/--activity/--sort to lowercase in validateListFlags, remove redundant empty check on filterActivity - Web UI: validate localStorage phase filter against known values instead of raw cast - Web UI: validate localStorage sort config field/dir values before applying - Web UI: handle invalid date strings in formatRelativeTime with isNaN guard
…rm#295) * Add prominent disconnected overlay to web terminal When the WebSocket connection drops, a full-terminal overlay now appears with 50% black opacity and large red "DISCONNECTED" text centered on it. The overlay appears immediately on disconnect and disappears when the connection is re-established. The small status indicator in the toolbar remains as a secondary signal. Fixes GoogleCloudPlatform#77 * Move disconnected overlay to be a sibling of xterm container The overlay was a child of .terminal-container, whose DOM is managed by xterm.js. Lit re-rendering the overlay on connect/disconnect state changes conflicts with xterm's DOM management. Fix: introduce .terminal-wrapper as the relative-positioning context, make .terminal-container absolutely positioned inside it, and render the overlay as a sibling — outside xterm's managed subtree. * Use wasConnected flag instead of terminal ref for overlay reactivity Replace the non-reactive `this.terminal` reference in the overlay condition with a new `@state() wasConnected` flag. This fixes two issues: 1. Lit reactivity: `this.terminal` lacked `@state()` so changes to it didn't trigger re-renders. The new `wasConnected` is properly decorated as reactive state. 2. Initial connection: using `this.terminal` would flash the overlay during the brief window between terminal init and WebSocket open. `wasConnected` is only set true after the first successful connect, so the overlay only appears after a genuine disconnection.
…tore port, LISTEN/NOTIFY (GoogleCloudPlatform#304) * P0-1: switch Postgres driver from lib/pq to pgx/v5 stdlib - Add github.com/jackc/pgx/v5/stdlib (registers as "pgx") - driver_postgres.go: blank import pgx stdlib instead of lib/pq - OpenPostgres: open via sql.Open("pgx", dsn) + entsql.OpenDB - Introduce PoolConfig (applied to *sql.DB); thread through OpenSQLite/OpenPostgres and update all callers - go mod tidy drops lib/pq * P0-2: add connection pool config to DatabaseConfig - DatabaseConfig gains MaxOpenConns / MaxIdleConns / ConnMaxLifetime plus ConnMaxLifetimeDuration() helper - DefaultGlobalConfig sets sqlite pool defaults (MaxOpenConns=1, load-bearing for write serialization) - applyDatabasePoolDefaults fills postgres defaults (20/5/30m) and forces sqlite MaxOpenConns=1; called in both load paths - Mirror fields in V1DatabaseConfig + both conversion directions - Wire pool settings into entc.OpenSQLite in initStore * P0-3/P0-4: CRUD-parity test harness + spec-driven fixture generator P0-3: pkg/store/storetest/ — backend-agnostic, table-driven CRUD oracle. A Factory(t) -> store.Store is injected; generic Domain[T] descriptors drive Create/Read/Update/Delete (+optional soft-delete)/List-paginate/List-filter. Ships group + policy domains and runs green against today's CompositeStore (SQLite base + Ent DB). Ready to accept a postgresFactory for P3-2. P0-4: internal/fixturegen/ — Go-defined spec seeding >=1 row per table across all 30 domain tables, with edge cases (NULL optionals, max-length strings, nested/unicode JSON, soft-deleted agent, BLOB). Deterministic. 'go run ./internal/fixturegen' emits testdata/hub-v46-fixture.db, prints a 30-table coverage report, and caches the blob to the scratchpad mount. CI gate fails if any table has zero rows. * feat(ent): add 23 new Ent schemas for full table parity (P1-2 + P1-3) * feat(observability): add Cloud Monitoring scaffolding for LISTEN/NOTIFY metrics (P0-5) * P2: port notification + gcp/github/token domains to Ent entadapter Add Ent-backed implementations of the notification, GCP service account, GitHub App installation, and user access token store sub-interfaces: - notification_store.go: NotificationStore (subscriptions, notifications, templates). Dispatch uses an atomic conditional update as the multi-replica claim primitive, and an optional NotificationPublisher designs in the LISTEN/NOTIFY fan-out for created/dispatched events. - external_store.go: GCPServiceAccountStore + GitHubInstallationStore + UserAccessTokenStore. GitHub create is idempotent (INSERT OR IGNORE semantics), repositories/scopes are JSON, default_scopes is CSV, and tokens support key-hash lookup. Legacy api_keys is intentionally not surfaced. - storetest: add GCPServiceAccount, SubscriptionTemplate, and NotificationSubscription CRUD-parity domains. Does not modify composite.go. * P2: port schedule, maintenance, message domains to Ent entadapter - schedule_store.go: ScheduleStore + ScheduledEventStore sub-interfaces with dialect-aware SELECT FOR UPDATE SKIP LOCKED claim helper for the ListDueSchedules / ListPendingScheduledEvents job-claim paths (plain SELECT on SQLite, SKIP LOCKED on Postgres). - maintenance_store.go: run-state RMW, AbortRunningMaintenanceOps, Go-side seed (uuid.New) replacing SQLite randomblob() UUID seeds. - message_store.go: CRUD, read flags, PurgeOldMessages, design-in PublishUserMessage hook for Postgres LISTEN/NOTIFY. - pkg/ent/client_driver.go: hand-written Client.Driver() accessor for dialect detection + raw locking queries. * feat(entadapter): port user + allowlist/invite domains to Ent (P2) Implements the Ent-backed store adapters for the user and allowlist/invite domains, plus their CRUD-parity oracle descriptors. pkg/store/entadapter/user_store.go (store.UserStore): - CreateUser/GetUser/GetUserByEmail/UpdateUser/UpdateUserLastSeen/ DeleteUser/ListUsers. - Case-insensitive email: emails are normalized to lower case on write (so the plain unique index enforces case-insensitive uniqueness, equivalent to the legacy UNIQUE COLLATE NOCASE) and matched with EmailEqualFold (lower(email)=lower($1)) on read. ent codegen + AutoMigrate cannot emit a real lower(email) functional index across both SQLite (tests) and Postgres, so the invariant is enforced at the port layer. - Offset-based pagination matching the legacy SQLite store. pkg/store/entadapter/allowlist_store.go (store.AllowListStore + store.InviteCodeStore): - Full allow-list + invite-code CRUD. - BulkAddAllowListEntries uses CreateBulk + OnConflictColumns(email). Ignore() for race-safe INSERT-OR-IGNORE; added/skipped counts mirror the legacy per-row semantics (existing + within-batch dups skipped). - IncrementInviteUseCount is a single atomic conditional UPDATE (revoked=false AND not expired AND (max_uses=0 OR use_count<max_uses)), which is race-free on both backends without SELECT...FOR UPDATE. The sql/lock feature is enabled and ForUpdate is available for genuine multi-statement RMW paths. - ListAllowListEntriesWithInvites batch-joins invite codes (invite_id is a plain column, not an Ent edge). Schema: - pkg/ent/schema/user.go: add nillable last_seen field (+ index) needed by UpdateUserLastSeen / lastSeen sort; document the case-insensitive email strategy. - pkg/ent/generate.go: enable --feature sql/upsert,sql/lock (required for OnConflict and ForUpdate). Tests (all passing): - pkg/store/storetest/domains_user.go: UserDomain, AllowListDomain, InviteCodeDomain oracle descriptors (kept in a separate file to avoid contending on domains.go). - entadapter oracle test runs the shared CRUD-parity suite directly against the new adapters; behavior tests cover case-insensitivity, bulk idempotency, conditional increment, stats, and the invite join. NOTE: Generated Ent code under pkg/ent/** is intentionally NOT included. This is a shared worktree where sibling port agents concurrently modify schemas and the same feature flags; the generated code must be regenerated at wave integration via: go generate ./pkg/ent/... Verified locally that regeneration + full build + tests pass. Per P2 scope: composite.go wiring and ensureEntUser shadow removal are deferred to P2-collapse. * P2: port secret/env_var + template/harness_config domains to Ent Add Ent-backed store implementations for the secret/env and template/harness domains, mirroring the legacy SQLite semantics: - entadapter/secret_store.go: SecretStore implementing store.SecretStore + store.EnvVarStore. Polymorphic (scope, scope_id) addressing, COALESCE target->key projection, version bump on update, get-then-update upsert, and transitive ListProgenySecrets via a created_by IN-list over the ancestor set (user scope + allow_progeny only; encrypted value withheld). - entadapter/template_store.go: TemplateStore implementing store.TemplateStore + store.HarnessConfigStore. base_template hierarchy, scope/project_id backwards-compat lookups, content_hash, JSON config/files columns, DeleteByScope. Subscription templates are owned by NotificationStore. - Direct Ent unit tests incl. a progeny-inheritance parity test. - storetest: Template/HarnessConfig/Secret/EnvVar domain descriptors wired into RunStoreSuite for cross-backend CRUD parity. * P2: port project/broker + brokersecret domains to Ent Port the project/broker domain (projects, runtime_brokers, project_contributors, project_sync_state) and the broker-auth domain (broker_secrets, broker_join_tokens) from raw SQL to Ent adapters. - pkg/store/entadapter/project_store.go: implements ProjectStore, RuntimeBrokerStore, ProjectProviderStore and ProjectSyncStateStore. * provider + sync-state upserts use Ent OnConflict().UpdateNewValues() (sql/upsert) keyed on the (project_id, broker_id) unique index. * runtime broker heartbeat/update use an optimistic version-CAS loop on a new internal lock_version token, serializing concurrent writers portably across SQLite (tests) and Postgres without SELECT ... FOR UPDATE. * slug lookups support case-insensitive matching (EqualFold). * project computed fields (AgentCount, ActiveBrokerCount, ProjectType) are derived via Ent queries, matching the legacy SQLite store. - pkg/store/entadapter/brokersecret_store.go: implements BrokerSecretStore (per-broker HMAC secrets + short-lived join tokens, expiry cleanup). - Project Ent schema: add operational fields for full parity (default_runtime_broker_id, shared_dirs, github_*, git_identity). - RuntimeBroker Ent schema: relax vestigial type column to Optional, add internal lock_version concurrency token. - Regenerate Ent with sql/upsert,sql/lock features. - storetest: add Project, RuntimeBroker, BrokerSecret and BrokerJoinToken CRUD-parity domains. - Unit tests for both adapters. Per the integration plan, composite.go wiring and ensureEntProject shadow removal are deferred to P2-collapse. * P2: port agent domain to Ent entadapter (XL) * chore(ent): regenerate Ent code for all 30 entity schemas Regenerated with --feature sql/upsert,sql/lock to support OnConflict upserts and ForUpdate/SKIP LOCKED job claims. * P2-collapse: collapse dual-DB into single Ent store Wire all Ent-backed sub-stores into CompositeStore via embedding, removing the raw-SQL base store and the User/Agent/Project shadow-sync machinery (ensureEntUser/ensureEntAgent/ensureEntProject). CompositeStore now serves every domain from a single Ent client and implements Close/Ping/Migrate directly. Collapse initStore() to open one Ent SQLite DB (no _ent shadow DSN, no MigrateGroveToProjectData, no raw sqlite.New). Register the User, AllowList, and InviteCode domains in the storetest CRUD-parity suite. Update entadapter tests for the single-DB NewCompositeStore(client) signature. go build ./... green; go test ./pkg/store/entadapter/... ./pkg/store/storetest/... green. * P2-delete: remove raw-SQL store implementation Delete the ~6k-LOC raw-SQL store (sqlite.go) and its per-domain sibling files (brokersecret, gcp_service_account, github_installation, maintenance, messages, notification, project_sync_state, schedule, scheduled_event) plus their tests, including the inline schema-migration scaffold. Keep driver.go, which registers the pure-Go SQLite driver used by Ent's SQLite backend. Repoint the two non-test consumers to the Ent-backed store: - cmd/hub_secret_migrate.go now opens an Ent client + CompositeStore. - internal/fixturegen opens via entc and seeds the Ent schema's *sql.DB. go build ./... green; no remaining production references to the raw store. * test: compile-migrate downstream suites to Ent store + fix signing-key PK Replace the removed raw-SQL store in downstream tests with an Ent-backed newTestStore helper (pkg/hub, pkg/secret) and update cmd/server_test.go and internal/fixturegen tests. Port the 8 raw-SQL DB() access sites in hub tests via a new CompositeStore.DB() escape-hatch accessor. Fix a production bug surfaced by the collapse: hub/server.go signingKeySecretID generated a non-UUID secret primary key, which the Ent secret store rejects; it now derives a deterministic UUIDv5. go build ./... green; entadapter and storetest suites green. NOTE: hub/secret/fixturegen suites now COMPILE but many tests still fail because their fixtures seed non-UUID string IDs that the UUID-PK Ent schema rejects; addressed in follow-up commits (tid() helper). * test(hub): map non-UUID fixture IDs to UUIDs via tid() helper Wrap human-readable test identifiers in tid() (deterministic UUIDv5) so the UUID-PK Ent store accepts them while preserving cross-reference consistency and ID-equality assertions. Reduces pkg/hub failures from 611 to 79; remaining failures are behavioral, not ID-format, and are addressed separately. # Conflicts: # pkg/hub/handlers_project_test.go # pkg/hub/httpdispatcher_test.go * fix(store): seed maintenance ops in Migrate; initStore uses Migrate Restore raw-SQL parity: CompositeStore.Migrate now runs AutoMigrate and seeds built-in maintenance operations (the raw store seeded these in its migrations). initStore and hub test helpers call s.Migrate() so production and tests seed consistently. Fixes the maintenance-operation hub tests (404 'Operation not found'). pkg/hub failures 79 -> 71. * test(hub): satisfy Ent NotEmpty validators in fixtures Add slugs/broker names to test fixtures that previously relied on the raw store's lenient (no-validator) inserts: project/agent slugs in the logs test helper, broker slugs in embedded/profile/authz fixtures, and BrokerName on envgather ProjectProvider literals. pkg/hub failures 71 -> 57. * test(secret): map non-UUID fixture IDs to UUIDs via tid() Apply the tid() helper to pkg/secret fixtures (including a dynamically built secret ID) so the UUID-PK Ent store accepts them. pkg/secret now fully green. * test(cmd): map non-UUID fixture IDs to UUIDs via tid(); add broker slug/name Wrap broker/grove/agent IDs passed to registerGlobalProjectAndBroker and the dispatcher tests in tid(), and supply RuntimeBroker.slug / ProjectContributor broker_name to satisfy Ent validators. cmd now green except TestDeleteStopped_RequiresGroveContext, which requires the 'docker' binary (absent in this sandbox) and is unrelated to the store migration. # Conflicts: # cmd/server_dispatcher_test.go * test(hub): wrap remaining latent non-UUID fixture IDs Catch IDs that surfaced behind earlier failures (stale-agent-*, agent-visible-authz, agent-profile-hb, env-owner-1). No more UUID-parse errors in pkg/hub; the remaining ~56 failures are behavioral (URL paths built from old raw IDs, assertion mismatches), addressed next. * fix(entadapter): Get-by-id returns ErrNotFound for non-UUID identifiers Restore raw-SQL store parity: a malformed identifier cannot match any UUID primary key, so get-by-id lookups now report store.ErrNotFound instead of store.ErrInvalidInput. This matches the raw store (a lookup with a bad id simply returned no row) and is what callers depend on — e.g. resolveTemplate passes a template *name* to GetTemplate and relies on ErrNotFound to fall back to slug-based resolution. New parseGetID helper applied across all 17 get-by-id methods. pkg/hub failures 56 -> 40; entadapter/storetest stay green. * test(hub): fix store-less id wraps and project-route URL paths - controlchannel_client_test: revert tid() wraps (store-less path-builder test; IDs must match the expected literal paths). - github/envgather: project-scoped route handlers resolve the project by UUID id, so build paths with tid(rawID) via fmt.Sprintf instead of the old raw-id literal. pkg/hub failures 40 -> 32. * test(hub): unwrap projectIDFromServiceAccountEmail expectation The tid() sweep over-wrapped a non-ID expected value in a pure-function test; restore the literal GCP project id. * fix(ent): GCPServiceAccount.project_id is a string, not a UUID The GCP service account project_id holds the GCP *cloud project* identifier (e.g. 'my-project-123'), a free-form string — not a UUID. The schema declared it field.UUID, so entadapter CreateGCPServiceAccount/Update did parseUUID(sa.ProjectID) and rejected real GCP project ids, breaking SA mint/create with a 400 in production (storetest masked it by passing a UUID). Change the schema field to field.String, regenerate Ent, and store/read project_id as a string in external_store.go. Fixes ~7 hub GCP tests; pkg/hub 31 -> 23. * test(hub): fix GCP SA project-id assertion and project-settings id Unwrap the over-wrapped 'my-project' expectation now that project_id is a string, and wrap the dynamic project-settings project ID with tid(). * test(hub): fix bootstrap sync-to-finalize agent paths and storage keys Build the finalize request path from the agent's tid() UUID and seed mock storage under WorkspaceStoragePath(projectID, agent.ID) — the handler derives the workspace key from the agent's real id, not the old raw name. pkg/hub 23 -> 19. * test(hub): revert tid() over-wraps in store-less events_test events_test exercises the in-memory ChannelEventPublisher directly; its ProjectID/IDs are subject-string components, not stored UUIDs. The tid() sweep wrongly rewrote them so published subjects no longer matched the subscriptions (timeouts). Restore the literal values. pkg/hub 19 -> 12. * test(hub): fix maintenance-run path and notifications agentId queries Use tid() UUIDs in the maintenance run-detail path and the notifications agentId query params; guard list indexing with require.Len so a mismatch fails cleanly instead of panicking (panics truncate the package run). * test(hub): wrap remaining fixture IDs revealed after panic-cascade cleared Panics ([0] on empty lists) had been truncating the package run, hiding many failures and starving the tid() sweep. With those guarded, sweep the newly reached tests: wrap dynamic rune-suffix IDs and the setupProjectWithBroker / seedCreatedAgentForHarnessTest helper IDs, and convert raw query-param project IDs to tid(). No UUID-parse errors remain in pkg/hub. * test(hub): unwrap tid() in scheduler_test (mock store, raw ids) scheduler_test uses an in-memory mockScheduledEventStore, not the Ent store, so its ids need no UUIDs; the erroneous tid() wraps broke raw getEvent lookups and caused a nil-pointer panic that truncated the package run. * fix(ent): Template.harness may be empty (raw-store parity) A template imported from a directory that declares no harness type has an empty harness; the raw-SQL store stored it, but the Ent NotEmpty validator made BootstrapTemplatesFromDir silently skip such templates. Drop NotEmpty and regenerate. Removing the [0]-on-empty panics this caused un-truncates the hub package run (true failure count now visible). * test(hub): wrap dynamic fixture IDs in wake/workspace/signing-key tests Wrap tid() around the wake_test, setupWorkspaceProject, and empty-value signing-key secret IDs now reachable after panic removal. No panics in the hub package run. * test(hub): convert raw-id URL path segments to tid() Build GET/PUT/DELETE paths for agents/projects/brokers/templates/harness-configs and workspace sync routes from tid(rawID) so the by-id handlers resolve the entity (raw ids no longer match the UUID PKs). pkg/hub 93 -> 80. * fix(entadapter)+test(hub): FK error mapping + permissions FK fixtures mapError now distinguishes foreign-key violations (-> ErrInvalidInput, a bad reference) from unique-constraint violations (-> ErrAlreadyExists); previously both surfaced as a misleading 'already exists'/409. Seed the users/agents that group memberships and policy bindings reference (the Ent store enforces user/agent FK edges the raw store lacked), wrap remaining raw fixture/URL ids in tid(), and give the AddAgent fixtures slugs. All pkg/hub permissions tests pass. * fix(hub): seed creator users for agent-created agents; cascade-delete subscriptions on hard agent delete * test(hub): seed broker slug/name in dispatcher and project_cache fixtures (Ent validators) * test(hub): use tid() in principal/agent URL paths; broker slug in template_bootstrap * fix(entadapter): cascade-delete agents on project delete (raw-store parity); test(hub): seed FK users, broker_name, deterministic UUIDs * test(hub): MaxOpenConns=1 for SQLite test store (serialize writes); tid() URLs + FK user seeds in events/stopall * test(hub): unwrap over-wrapped tid() in unit tests (workspace/logfilter/gcp/web); valid-UUID NotFound cases; tid() scheduled-event URLs * fix(ent): allow empty display_name (raw-store NOT NULL parity, email fallback); test(hub): seed FK owner users, UUID policy/broker/agent IDs in authz remediation * feat(migrate): add Migration β tool (Ent-SQLite → Ent-Postgres) Implements 'scion server migrate --from sqlite://... --to postgres://...' per postgres-strategy.md §7.3. - entc.OpenSQLiteReadOnly: opens source with PRAGMA query_only=ON (no WAL write), MaxOpenConns=1 so the source is never mutated. - entc.MigrateData: generic reflection-based, dependency-ordered copy of all 30 Ent entities (FK-ordered core first), idempotent (skips rows whose PK already exists), atomic per entity (txn), chunked CreateBulk, source/dest row-count verification after each entity, plus the Group.child_groups M2M edge. FK columns are plain fields so edges are preserved via setters. - cmd/server migrate: DSN parsing (sqlite://, file:, bare path; postgres URL or keyword form), --keep-source default / --drop-source cutover, progress logging. Verified end-to-end against live CloudSQL Postgres 16 (integration test + real CLI run): full copy, idempotent re-run, FK + M2M + value round-trips, --drop-source removal. * feat(concurrency): dialect-aware multi-replica primitives for Postgres (P3-3..6) Add cluster-coordination primitives so N stateless hub processes can share one Postgres, each degrading to a no-op on single-writer SQLite: - store.AdvisoryLocker + entadapter TryAdvisoryLock (pg_try_advisory_lock on a dedicated conn); Scheduler.RegisterRecurringSingleton gates the heartbeat, stalled, purge, schedule-evaluator and github-health sweeps to one replica/tick. - store.ScheduledEventClaimer + ClaimScheduledEvent atomic claim; fireEvent claims one-shot events before side effects (dedup across replica startup recovery). - CompositeStore.RunSerializable: SERIALIZABLE + retry on 40001/40P01 (single run on SQLite) for future multi-row invariants. - dbmetrics.StartPoolSampler feeds DB connection-pool gauges to the P0-5 scaffold; wired into StartBackgroundServices via SetDBMetrics. Verified existing primitives correct (agent StateVersion CAS, FOR UPDATE sweeps, notification atomic dispatch). Found and documented the schedule SKIP LOCKED early-commit gap (lock released before the status transition), closed by the singleton evaluator. Audit + budget docs in scratchpad. Tests: locking_test.go (advisory no-op, serializable, claim exactly-once incl. 8-way concurrent), pool_sampler_test.go. * feat(hub): widen events to EventPublisher interface + Postgres LISTEN/NOTIFY publisher P3-7: Decouple call sites from the concrete *ChannelEventPublisher. - Add Subscribe(patterns...) (<-chan Event, func()) to the EventPublisher interface; implement it on noopEventPublisher (nil channel) — *ChannelEventPublisher already had it. - Factor the Publish* methods into a shared eventBuilder (sink func) so every backend emits identical subjects/payloads; ChannelEventPublisher embeds it. - web.go (field + SetEventPublisher), messagebroker.go and notifications.go (field + constructor) now take EventPublisher; handlers_messages.go gates SSE on "not the no-op publisher" instead of a concrete type assertion. P3-8: PostgresEventPublisher over pgx LISTEN/NOTIFY (cross-replica delivery). - Per-grove channels plus a global channel (flat exact-match); event type in the JSON envelope. Grove-scoped subjects publish to both the grove channel and the global channel; subscriptions group their patterns by resolved channel so an event is matched only against patterns that opted into the arriving channel (no double delivery). - 8 KB NOTIFY limit handled by reference-and-refetch via scion_event_payloads (TTL-swept so every replica can refetch). - PublishTx enrolls the NOTIFY in a caller transaction (atomic write+publish; rollback => no deliver). Delivery flows exclusively through the listener. - Listener goroutine reconnects with backoff and re-LISTENs (resubscribe); dynamic LISTEN/UNLISTEN applied on a poll (WaitForNotification timeout does not invalidate the pgconn connection). - Emits pkg/observability/dbmetrics signals (published/delivered/dropped, payload size, publish->deliver latency, reconnects, pool stats). - cmd: newEventPublisher selects the backend by database driver (postgres => PostgresEventPublisher, else ChannelEventPublisher) with safe fallback. Tests: routing/registry/payload-offload/metrics/transactional-executor unit tests run without a DB; cross-replica delivery, oversized round-trip, transactional rollback, and reconnect+resubscribe are gated behind SCION_TEST_POSTGRES_DSN. go build ./... green; full pkg/hub suite green. Note: server.go's equivalent type-assertion cleanup is left in the working tree (co-edited with concurrent P0-5/scheduler work) and is functionally optional — HEAD server.go already compiles against the widened interface. * test(store): parameterize store suites over {sqlite, postgres} (P3-2) Add pkg/store/enttest: a backend-selecting Ent client factory for the store test suites. Default is in-memory SQLite; built with -tags integration and SCION_TEST_POSTGRES_URL set, it provisions a per-package ephemeral Postgres database (created/dropped via TestMain) and isolates each test in its own schema (search_path) so tests never observe each other's rows. Falls back to SQLite when the env var is unset. Route all entadapter and storetest helpers through enttest.NewClient so the same CRUD-parity oracle runs unchanged against either backend. Fix two real Postgres bugs surfaced by the new path: - entadapter/dialect.go ancestryContains: emit the bind parameter via Builder.Arg ($n on Postgres) instead of a literal '?' through ExprP, which was not rebound and produced a syntax error; and use jsonb_array_elements_text (the column is jsonb on Postgres, not json). - schedule_store_test ClaimPath: make the concurrent-claim assertion backend-aware. SQLite serializes (MaxOpenConns=1, no SKIP LOCKED) so every caller sees both due rows; Postgres uses FOR UPDATE SKIP LOCKED so concurrent callers may observe a disjoint subset (0..2) and must only never error or exceed 2. Verified: full SQLite suite green; storetest CRUD parity green on CloudSQL Postgres; entadapter green on Postgres (schedule ClaimPath fix confirmed). * fix(hub): start dispatcher/broker for any subscription-capable EventPublisher Wave C integration: newEventPublisher can now return a PostgresEventPublisher (LISTEN/NOTIFY) in addition to ChannelEventPublisher. The dispatcher/broker startup previously hard-asserted *ChannelEventPublisher, which silently skipped starting them under Postgres. Gate on (not noop and not nil) instead, matching the existing pattern in handlers_messages.go. * fix(hub): harden Postgres event publish + verify wiring; lower PG pool default Task 1 — LISTEN/NOTIFY publish path: - Add TestPostgresIntegration_HandlerCreateProjectEmitsNotify: drives the real POST /api/v1/projects handler with a PostgresEventPublisher and asserts a pg_notify lands on scion_ev_global via an independent raw LISTEN — the exact capability the multi-replica live test probed. Verified PASSING against live CloudSQL, proving the handler -> s.events -> pg_notify wiring is correct end to end (the four pre-existing SCION_TEST_POSTGRES_DSN integration tests also pass). The multi-hub 'no NOTIFY' symptom was not reproducible against the current tree. - Bound the autocommit publish (Publish* methods) with publishTimeout (5s). These run synchronously on the caller's (request handler) goroutine and acquire from the event pool; on a connection-starved instance that acquire could block indefinitely, stalling CRUD and silently never emitting NOTIFY. The timeout converts that into a logged error + dropped event (publishing is fire-and-forget). PublishTx (transactional path) is unaffected. Task 2 — connection budget: - Lower the default Postgres MaxOpenConns 20 -> 10 so multiple replicas fit a modest connection budget (see CONNECTION-BUDGET.md). CloudSQL instance scion-postgres-test resized db-f1-micro -> db-g1-small and max_connections set to 100 (out of band). * test(store): add Postgres stress/integration suite (contention, isolation, pool, NOTIFY, migration, schema, multi-process) Add pkg/store/integrationtest/: a Postgres-only suite that exercises behavior the SQLite parity suites cannot reach. Gated by //go:build integration and SCION_TEST_POSTGRES_URL; skips cleanly otherwise. Coverage: - Contention: state_version CAS race (no lost updates, >=N-1 retries, final version==1+N), SKIP LOCKED / conditional-UPDATE event claim (single winner + disjoint drain), unique-key races (project slug, user email, agent slug). - Isolation: SERIALIZABLE conflict + RunSerializable retry recovery, REPEATABLE READ no-phantom snapshot, READ COMMITTED dirty-read prevention. - Pool: exhaustion + queued recovery, saturated pool honoring context deadline, long txn not starving short queries, healing after pg_terminate_backend. - LISTEN/NOTIFY: ordered burst no-drop, 8000B payload limit, listener reconnect/resume, cross-channel isolation. - Migration: 1000+ row counts + bounded-memory listing, idempotent re-migration. - Schema: NULL semantics, unicode/emoji, nested JSON + special chars, large-text non-truncation, TIMESTAMPTZ microsecond precision. - Multi-process: forks the test binary for cross-process advisory-lock exclusivity and cross-process NOTIFY delivery. Configurable concurrency via SCION_TEST_CONCURRENCY (default 10). Extend pkg/store/enttest with Active() and NewSchemaURL() so tests can open custom-pool clients and share a DSN with forked child processes; non-integration stubs keep the package API stable. * fix(db): recycle stale conns + keepalives; skip singleton tick on lock error Stale-connection pool stalls (CloudSQL drops idle conns after ~10m): - Add ConnMaxIdleTime to DatabaseConfig/PoolConfig (default 5m pg, 0 sqlite) and apply SetConnMaxIdleTime on the database/sql pool. - OpenPostgres now parses the DSN with pgx and opens via stdlib.OpenDB with TCP keepalive GUCs (idle 60s / interval 15s / count 4) and a 10s connect timeout, so a silently-dropped peer is detected instead of the first query after idle hanging on a dead socket. - pgx event pool (events_postgres.go): set keepalives + connect timeout on both the pool's ConnConfig and the dedicated listener connection, plus MaxConnIdleTime 5m / MaxConnLifetime 30m. Advisory-lock leader election (scheduler.go): - A lock-acquisition error no longer falls open to running the handler unguarded (which would duplicate singleton work across replicas); the tick is skipped and retried next interval. Added regression tests. Test harness (enttest/integrationtest): - Accept libpq keyword/value DSNs (not just URL form) when deriving the ephemeral db/schema/params; add WithConnParam helper. - Fix migration idempotency test's per-pass row-count expectation. * fix(store): bound advisory-lock conn checkout + unlock with short timeout TryAdvisoryLock checked a connection out of the pool and ran the unlock on the full 55s scheduler-handler context (acquire) and an unbounded context.Background() (release). On a pool that could not promptly serve a healthy connection, db.Conn() blocked for the entire 55s before failing with 'context deadline exceeded' on every tick; with several singleton handlers firing each 60s tick, those long-blocked goroutines and their pending pool connection requests piled up across ticks and kept the pool jammed (checked out client-side, idle server-side). The unbounded unlock was a second leak vector: if the held connection died mid critical-section, ExecContext could hang forever, so conn.Close() never ran and the connection leaked out of the pool permanently. Bind both the acquire (db.Conn + pg_try_advisory_lock) and the release (pg_advisory_unlock) to a 5s timeout so a bad tick fails fast and retries next tick instead of parking a goroutine for ~55s, and so a dead connection can never block release from freeing the conn. Lock semantics are unchanged: cancelling the acquire context tears down only that context, not the checked-out session that holds the lock. * feat(migrate): in-process migration α (legacy raw-SQL hub.db → Ent) Upgrade a legacy raw-SQL Hub database (the ~53-migration, 30-table schema from the removed pkg/store/sqlite store) to the consolidated Ent-backed SQLite schema, in-process on first boot, behind an automatic backup. pkg/ent/entc/migrate_alpha.go: - IsLegacyRawSQLSchema: detect via the schema_migrations sentinel + the legacy-only agents.agent_id column (no-op for an Ent/empty/absent file). - MigrateAlphaSQLite: backup (checkpoint WAL + copy to hub.db.bak.<ts>), AutoMigrate a fresh Ent schema, ATTACH the legacy file, copy every table with INSERT…SELECT (foreign_keys OFF), verify per-table row counts, then atomically swap the migrated file into place. - Data-driven column mapping (created_at→created, updated_at→updated, agents.agent_id→slug, policies→access_policies); bespoke SQL for the group_members/policy_bindings polymorphic splits and surrogate ids; groups.parent_id→group_child_groups edge. - Deterministic UUIDv5 remap for legacy non-UUID primary keys (internal signing-key secrets; plugin runtime-broker ids) with consistent rewrite of every foreign-key reference via a TEMP _id_remap table. - Tolerates missing legacy tables (older schema versions). cmd/server_foreground.go: detect + migrate in initStore's sqlite path, with a --no-auto-migrate operator opt-out (cmd/server.go). Validated end-to-end against four production hub.db files (scion-integration, -integration2, -demo, -gteam): exact row-count parity (up to ~19k rows), every entity reads back through the live Ent store, idempotent re-runs, and broker FK references resolve post-remap. Pre-existing dangling agent created_by/owner_id refs are faithfully preserved (loader runs FK-off). * fix(config): apply real Postgres pool size (leaked SQLite default of 1 starved the pool) The struct-level default for Database.MaxOpenConns/MaxIdleConns is 1 — the value SQLite REQUIRES to serialize writes. applyDatabasePoolDefaults only bumped postgres to a real pool when the value was <= 0, but a postgres deployment configured via env/driver override inherits the embedded default of 1, so the guard never fired and the Ent pool ran with a SINGLE connection. Effect in production (both integration hubs): every singleton scheduler tick checks out the lone pool connection to hold its advisory lock, then blocks waiting for a second connection to do its work — a self-deadlock that resolves only at the 55s handler context deadline. All API requests serialize behind the one connection, so GET /api/v1/* served in ~55s across the board. Note env overrides could not paper over this: envKeyToConfigKey splits on every underscore, so SCION_SERVER_DATABASE_MAX_OPEN_CONNS maps to database.max.open.conns, not database.max_open_conns — silently ignored. Treat the leaked SQLite default (<= 1) as 'unset' for postgres so the pool default (10) applies; explicit sizing of 2+ is still respected. SQLite remains pinned to 1. Adds regression tests for all three cases. * docs: add multi-node broker dispatch and NFS workspace designs - broker-dispatch.md: DB-as-state-machine + LISTEN/NOTIFY pattern for cross-replica broker command routing and agent lifecycle dispatch - nfs-workspace.md: NFS workspace coordination for VM (host bind-mount) and K8s/Cloud Run (per-pod mount) runtime models * fix(store): address PR GoogleCloudPlatform#304 review — context leaks and DSN parsing Thread the server's cancellable context into initStore and initWebServer instead of using context.Background(), so that: - DB migrations and the health-check ping cancel on Ctrl+C during startup (medium-priority review comment). - The Postgres LISTEN/NOTIFY event publisher goroutine shuts down cleanly when the server exits, preventing connection leaks (high-priority review comment). Also fix parseSQLiteSourceDSN to handle the file:// prefix before the file: prefix, so that file:///var/lib/hub.db correctly resolves to /var/lib/hub.db instead of ///var/lib/hub.db. Add test cases for file:// and file:/// DSN forms. * docs: add project log for PR GoogleCloudPlatform#304 review fixes * fix(store): context leak in legacy migration & double file: prefix 1. Thread the server's cancellable context through maybeMigrateLegacySQLite → MigrateAlphaSQLite so that Ctrl+C during first-boot legacy migration aborts it instead of running with an uncancellable context.Background(). 2. Guard against a double "file:" prefix when constructing the SQLite DSN. If the operator's database.url already starts with "file:", we no longer blindly prepend another "file:" prefix. Also correctly appends cache=shared with "&" when the DSN already contains query parameters. * fix(store): rename ProjectTypeHubNative → ProjectTypeHubManaged (rebase fixup) Upstream renamed hub-native to hub-managed while the PR was in flight. Update the two remaining references that the rebase conflict resolution missed. --------- Co-authored-by: Scion <agent@scion.dev>
…t token TestClient_StartTokenRefresh exercised RefreshToken -> WriteTokenFile without isolating the token home, so running the suite inside a live agent container overwrote the real ~/.scion/scion-token with the test stub "refreshed-token". Every subsequent Hub call then 401'd with "compact JWS format must have three parts" / "unrecognized token format". - Add SetTokenHome(t.TempDir()) to the test, matching its siblings. - Guard WriteTokenFile: panic under `go test` unless SetTokenHome was called, so a forgotten isolation can never corrupt live state again. Reads remain unguarded (harmless; return empty when absent).
…ecycle + message routing (GoogleCloudPlatform#305) * Add canonical engineering glossary (GLOSSARY.md) (#102) * Add engineering glossary (GLOSSARY.md) with canonical terms and cleanup tracker Add a root-level GLOSSARY.md capturing canonical Scion terminology in the ubiquitous-language format (preferred term + synonyms to avoid), grouped by domain cluster, plus an Exceptions & Future Cleanup section tracking known naming-convergence work. Link it from agents.md as the canonical engineering glossary. * Revise glossary: broker reframe, Event Bus, Hub-managed, and term refinements Refine entries from review: redefine Message Broker as the pluggable messaging-integration system (add Broker plugin, Built-in broker); add Event Bus for the NATS real-time/event capability; collapse hub-native/Hub Workspace into Hub-managed project/workspace; tighten Template (harness-agnostic, optional default harness-config), Skill (template-only, Agent Skills link), Profile (named runtime-broker settings bundle), Harness/Harness-config; reframe Hub as the control plane in both modes; add Group and Message Group. Expand Exceptions & Future Cleanup to nine tracked items. * Glossary: restructure headings, add cross-refs, modes table, and new terms - Retitle to "Scion Glossary"; drop the "Language" wrapper and promote the thematic categories to top-level sections - Add an Operations section (Attach, Dispatch) and move Profile next to Runtime Broker - Add a Local/Workstation/Hosted comparison table and "See also" cross-refs across the main confusable term clusters - Reframe the intro around the three-way broker collision (incl. Event Bus) and defer to the disambiguation rule; sentence-case "Shared directory" - Add canonical entries for Secret, Notification, and Schedule - Add a "Potential Future Additions" section cataloguing candidate terms * Glossary: remove Exceptions & Future Cleanup tracker The cleanup items are now tracked by dedicated agents that open GitHub issues and implementation PRs, so the staged tracker no longer lives in the glossary. Reword the two intro/disambiguation references that pointed at the removed section to point at GitHub issues instead. --------- Co-authored-by: Preston Holmes <ptone@google.com> * P0-1: switch Postgres driver from lib/pq to pgx/v5 stdlib - Add github.com/jackc/pgx/v5/stdlib (registers as "pgx") - driver_postgres.go: blank import pgx stdlib instead of lib/pq - OpenPostgres: open via sql.Open("pgx", dsn) + entsql.OpenDB - Introduce PoolConfig (applied to *sql.DB); thread through OpenSQLite/OpenPostgres and update all callers - go mod tidy drops lib/pq * P0-2: add connection pool config to DatabaseConfig - DatabaseConfig gains MaxOpenConns / MaxIdleConns / ConnMaxLifetime plus ConnMaxLifetimeDuration() helper - DefaultGlobalConfig sets sqlite pool defaults (MaxOpenConns=1, load-bearing for write serialization) - applyDatabasePoolDefaults fills postgres defaults (20/5/30m) and forces sqlite MaxOpenConns=1; called in both load paths - Mirror fields in V1DatabaseConfig + both conversion directions - Wire pool settings into entc.OpenSQLite in initStore * P0-3/P0-4: CRUD-parity test harness + spec-driven fixture generator P0-3: pkg/store/storetest/ — backend-agnostic, table-driven CRUD oracle. A Factory(t) -> store.Store is injected; generic Domain[T] descriptors drive Create/Read/Update/Delete (+optional soft-delete)/List-paginate/List-filter. Ships group + policy domains and runs green against today's CompositeStore (SQLite base + Ent DB). Ready to accept a postgresFactory for P3-2. P0-4: internal/fixturegen/ — Go-defined spec seeding >=1 row per table across all 30 domain tables, with edge cases (NULL optionals, max-length strings, nested/unicode JSON, soft-deleted agent, BLOB). Deterministic. 'go run ./internal/fixturegen' emits testdata/hub-v46-fixture.db, prints a 30-table coverage report, and caches the blob to the scratchpad mount. CI gate fails if any table has zero rows. * feat(ent): add 23 new Ent schemas for full table parity (P1-2 + P1-3) * P2: port notification + gcp/github/token domains to Ent entadapter Add Ent-backed implementations of the notification, GCP service account, GitHub App installation, and user access token store sub-interfaces: - notification_store.go: NotificationStore (subscriptions, notifications, templates). Dispatch uses an atomic conditional update as the multi-replica claim primitive, and an optional NotificationPublisher designs in the LISTEN/NOTIFY fan-out for created/dispatched events. - external_store.go: GCPServiceAccountStore + GitHubInstallationStore + UserAccessTokenStore. GitHub create is idempotent (INSERT OR IGNORE semantics), repositories/scopes are JSON, default_scopes is CSV, and tokens support key-hash lookup. Legacy api_keys is intentionally not surfaced. - storetest: add GCPServiceAccount, SubscriptionTemplate, and NotificationSubscription CRUD-parity domains. Does not modify composite.go. * P2: port schedule, maintenance, message domains to Ent entadapter - schedule_store.go: ScheduleStore + ScheduledEventStore sub-interfaces with dialect-aware SELECT FOR UPDATE SKIP LOCKED claim helper for the ListDueSchedules / ListPendingScheduledEvents job-claim paths (plain SELECT on SQLite, SKIP LOCKED on Postgres). - maintenance_store.go: run-state RMW, AbortRunningMaintenanceOps, Go-side seed (uuid.New) replacing SQLite randomblob() UUID seeds. - message_store.go: CRUD, read flags, PurgeOldMessages, design-in PublishUserMessage hook for Postgres LISTEN/NOTIFY. - pkg/ent/client_driver.go: hand-written Client.Driver() accessor for dialect detection + raw locking queries. * feat(entadapter): port user + allowlist/invite domains to Ent (P2) Implements the Ent-backed store adapters for the user and allowlist/invite domains, plus their CRUD-parity oracle descriptors. pkg/store/entadapter/user_store.go (store.UserStore): - CreateUser/GetUser/GetUserByEmail/UpdateUser/UpdateUserLastSeen/ DeleteUser/ListUsers. - Case-insensitive email: emails are normalized to lower case on write (so the plain unique index enforces case-insensitive uniqueness, equivalent to the legacy UNIQUE COLLATE NOCASE) and matched with EmailEqualFold (lower(email)=lower($1)) on read. ent codegen + AutoMigrate cannot emit a real lower(email) functional index across both SQLite (tests) and Postgres, so the invariant is enforced at the port layer. - Offset-based pagination matching the legacy SQLite store. pkg/store/entadapter/allowlist_store.go (store.AllowListStore + store.InviteCodeStore): - Full allow-list + invite-code CRUD. - BulkAddAllowListEntries uses CreateBulk + OnConflictColumns(email). Ignore() for race-safe INSERT-OR-IGNORE; added/skipped counts mirror the legacy per-row semantics (existing + within-batch dups skipped). - IncrementInviteUseCount is a single atomic conditional UPDATE (revoked=false AND not expired AND (max_uses=0 OR use_count<max_uses)), which is race-free on both backends without SELECT...FOR UPDATE. The sql/lock feature is enabled and ForUpdate is available for genuine multi-statement RMW paths. - ListAllowListEntriesWithInvites batch-joins invite codes (invite_id is a plain column, not an Ent edge). Schema: - pkg/ent/schema/user.go: add nillable last_seen field (+ index) needed by UpdateUserLastSeen / lastSeen sort; document the case-insensitive email strategy. - pkg/ent/generate.go: enable --feature sql/upsert,sql/lock (required for OnConflict and ForUpdate). Tests (all passing): - pkg/store/storetest/domains_user.go: UserDomain, AllowListDomain, InviteCodeDomain oracle descriptors (kept in a separate file to avoid contending on domains.go). - entadapter oracle test runs the shared CRUD-parity suite directly against the new adapters; behavior tests cover case-insensitivity, bulk idempotency, conditional increment, stats, and the invite join. NOTE: Generated Ent code under pkg/ent/** is intentionally NOT included. This is a shared worktree where sibling port agents concurrently modify schemas and the same feature flags; the generated code must be regenerated at wave integration via: go generate ./pkg/ent/... Verified locally that regeneration + full build + tests pass. Per P2 scope: composite.go wiring and ensureEntUser shadow removal are deferred to P2-collapse. * P2: port secret/env_var + template/harness_config domains to Ent Add Ent-backed store implementations for the secret/env and template/harness domains, mirroring the legacy SQLite semantics: - entadapter/secret_store.go: SecretStore implementing store.SecretStore + store.EnvVarStore. Polymorphic (scope, scope_id) addressing, COALESCE target->key projection, version bump on update, get-then-update upsert, and transitive ListProgenySecrets via a created_by IN-list over the ancestor set (user scope + allow_progeny only; encrypted value withheld). - entadapter/template_store.go: TemplateStore implementing store.TemplateStore + store.HarnessConfigStore. base_template hierarchy, scope/project_id backwards-compat lookups, content_hash, JSON config/files columns, DeleteByScope. Subscription templates are owned by NotificationStore. - Direct Ent unit tests incl. a progeny-inheritance parity test. - storetest: Template/HarnessConfig/Secret/EnvVar domain descriptors wired into RunStoreSuite for cross-backend CRUD parity. * P2: port project/broker + brokersecret domains to Ent Port the project/broker domain (projects, runtime_brokers, project_contributors, project_sync_state) and the broker-auth domain (broker_secrets, broker_join_tokens) from raw SQL to Ent adapters. - pkg/store/entadapter/project_store.go: implements ProjectStore, RuntimeBrokerStore, ProjectProviderStore and ProjectSyncStateStore. * provider + sync-state upserts use Ent OnConflict().UpdateNewValues() (sql/upsert) keyed on the (project_id, broker_id) unique index. * runtime broker heartbeat/update use an optimistic version-CAS loop on a new internal lock_version token, serializing concurrent writers portably across SQLite (tests) and Postgres without SELECT ... FOR UPDATE. * slug lookups support case-insensitive matching (EqualFold). * project computed fields (AgentCount, ActiveBrokerCount, ProjectType) are derived via Ent queries, matching the legacy SQLite store. - pkg/store/entadapter/brokersecret_store.go: implements BrokerSecretStore (per-broker HMAC secrets + short-lived join tokens, expiry cleanup). - Project Ent schema: add operational fields for full parity (default_runtime_broker_id, shared_dirs, github_*, git_identity). - RuntimeBroker Ent schema: relax vestigial type column to Optional, add internal lock_version concurrency token. - Regenerate Ent with sql/upsert,sql/lock features. - storetest: add Project, RuntimeBroker, BrokerSecret and BrokerJoinToken CRUD-parity domains. - Unit tests for both adapters. Per the integration plan, composite.go wiring and ensureEntProject shadow removal are deferred to P2-collapse. * P2: port agent domain to Ent entadapter (XL) * chore(ent): regenerate Ent code for all 30 entity schemas Regenerated with --feature sql/upsert,sql/lock to support OnConflict upserts and ForUpdate/SKIP LOCKED job claims. * P2-collapse: collapse dual-DB into single Ent store Wire all Ent-backed sub-stores into CompositeStore via embedding, removing the raw-SQL base store and the User/Agent/Project shadow-sync machinery (ensureEntUser/ensureEntAgent/ensureEntProject). CompositeStore now serves every domain from a single Ent client and implements Close/Ping/Migrate directly. Collapse initStore() to open one Ent SQLite DB (no _ent shadow DSN, no MigrateGroveToProjectData, no raw sqlite.New). Register the User, AllowList, and InviteCode domains in the storetest CRUD-parity suite. Update entadapter tests for the single-DB NewCompositeStore(client) signature. go build ./... green; go test ./pkg/store/entadapter/... ./pkg/store/storetest/... green. * P2-delete: remove raw-SQL store implementation Delete the ~6k-LOC raw-SQL store (sqlite.go) and its per-domain sibling files (brokersecret, gcp_service_account, github_installation, maintenance, messages, notification, project_sync_state, schedule, scheduled_event) plus their tests, including the inline schema-migration scaffold. Keep driver.go, which registers the pure-Go SQLite driver used by Ent's SQLite backend. Repoint the two non-test consumers to the Ent-backed store: - cmd/hub_secret_migrate.go now opens an Ent client + CompositeStore. - internal/fixturegen opens via entc and seeds the Ent schema's *sql.DB. go build ./... green; no remaining production references to the raw store. * test: compile-migrate downstream suites to Ent store + fix signing-key PK Replace the removed raw-SQL store in downstream tests with an Ent-backed newTestStore helper (pkg/hub, pkg/secret) and update cmd/server_test.go and internal/fixturegen tests. Port the 8 raw-SQL DB() access sites in hub tests via a new CompositeStore.DB() escape-hatch accessor. Fix a production bug surfaced by the collapse: hub/server.go signingKeySecretID generated a non-UUID secret primary key, which the Ent secret store rejects; it now derives a deterministic UUIDv5. go build ./... green; entadapter and storetest suites green. NOTE: hub/secret/fixturegen suites now COMPILE but many tests still fail because their fixtures seed non-UUID string IDs that the UUID-PK Ent schema rejects; addressed in follow-up commits (tid() helper). * test(hub): map non-UUID fixture IDs to UUIDs via tid() helper Wrap human-readable test identifiers in tid() (deterministic UUIDv5) so the UUID-PK Ent store accepts them while preserving cross-reference consistency and ID-equality assertions. Reduces pkg/hub failures from 611 to 79; remaining failures are behavioral, not ID-format, and are addressed separately. * fix(store): seed maintenance ops in Migrate; initStore uses Migrate Restore raw-SQL parity: CompositeStore.Migrate now runs AutoMigrate and seeds built-in maintenance operations (the raw store seeded these in its migrations). initStore and hub test helpers call s.Migrate() so production and tests seed consistently. Fixes the maintenance-operation hub tests (404 'Operation not found'). pkg/hub failures 79 -> 71. * test(hub): satisfy Ent NotEmpty validators in fixtures Add slugs/broker names to test fixtures that previously relied on the raw store's lenient (no-validator) inserts: project/agent slugs in the logs test helper, broker slugs in embedded/profile/authz fixtures, and BrokerName on envgather ProjectProvider literals. pkg/hub failures 71 -> 57. * fix(entadapter): Get-by-id returns ErrNotFound for non-UUID identifiers Restore raw-SQL store parity: a malformed identifier cannot match any UUID primary key, so get-by-id lookups now report store.ErrNotFound instead of store.ErrInvalidInput. This matches the raw store (a lookup with a bad id simply returned no row) and is what callers depend on — e.g. resolveTemplate passes a template *name* to GetTemplate and relies on ErrNotFound to fall back to slug-based resolution. New parseGetID helper applied across all 17 get-by-id methods. pkg/hub failures 56 -> 40; entadapter/storetest stay green. * test(hub): fix store-less id wraps and project-route URL paths - controlchannel_client_test: revert tid() wraps (store-less path-builder test; IDs must match the expected literal paths). - github/envgather: project-scoped route handlers resolve the project by UUID id, so build paths with tid(rawID) via fmt.Sprintf instead of the old raw-id literal. pkg/hub failures 40 -> 32. * test(hub): unwrap projectIDFromServiceAccountEmail expectation The tid() sweep over-wrapped a non-ID expected value in a pure-function test; restore the literal GCP project id. * fix(ent): GCPServiceAccount.project_id is a string, not a UUID The GCP service account project_id holds the GCP *cloud project* identifier (e.g. 'my-project-123'), a free-form string — not a UUID. The schema declared it field.UUID, so entadapter CreateGCPServiceAccount/Update did parseUUID(sa.ProjectID) and rejected real GCP project ids, breaking SA mint/create with a 400 in production (storetest masked it by passing a UUID). Change the schema field to field.String, regenerate Ent, and store/read project_id as a string in external_store.go. Fixes ~7 hub GCP tests; pkg/hub 31 -> 23. * test(hub): fix GCP SA project-id assertion and project-settings id Unwrap the over-wrapped 'my-project' expectation now that project_id is a string, and wrap the dynamic project-settings project ID with tid(). * test(hub): revert tid() over-wraps in store-less events_test events_test exercises the in-memory ChannelEventPublisher directly; its ProjectID/IDs are subject-string components, not stored UUIDs. The tid() sweep wrongly rewrote them so published subjects no longer matched the subscriptions (timeouts). Restore the literal values. pkg/hub 19 -> 12. * test(hub): fix maintenance-run path and notifications agentId queries Use tid() UUIDs in the maintenance run-detail path and the notifications agentId query params; guard list indexing with require.Len so a mismatch fails cleanly instead of panicking (panics truncate the package run). * test(hub): wrap remaining fixture IDs revealed after panic-cascade cleared Panics ([0] on empty lists) had been truncating the package run, hiding many failures and starving the tid() sweep. With those guarded, sweep the newly reached tests: wrap dynamic rune-suffix IDs and the setupProjectWithBroker / seedCreatedAgentForHarnessTest helper IDs, and convert raw query-param project IDs to tid(). No UUID-parse errors remain in pkg/hub. * test(hub): unwrap tid() in scheduler_test (mock store, raw ids) scheduler_test uses an in-memory mockScheduledEventStore, not the Ent store, so its ids need no UUIDs; the erroneous tid() wraps broke raw getEvent lookups and caused a nil-pointer panic that truncated the package run. * fix(ent): Template.harness may be empty (raw-store parity) A template imported from a directory that declares no harness type has an empty harness; the raw-SQL store stored it, but the Ent NotEmpty validator made BootstrapTemplatesFromDir silently skip such templates. Drop NotEmpty and regenerate. Removing the [0]-on-empty panics this caused un-truncates the hub package run (true failure count now visible). * test(hub): wrap dynamic fixture IDs in wake/workspace/signing-key tests Wrap tid() around the wake_test, setupWorkspaceProject, and empty-value signing-key secret IDs now reachable after panic removal. No panics in the hub package run. * test(hub): convert raw-id URL path segments to tid() Build GET/PUT/DELETE paths for agents/projects/brokers/templates/harness-configs and workspace sync routes from tid(rawID) so the by-id handlers resolve the entity (raw ids no longer match the UUID PKs). pkg/hub 93 -> 80. * fix(hub): seed creator users for agent-created agents; cascade-delete subscriptions on hard agent delete * test(hub): seed broker slug/name in dispatcher and project_cache fixtures (Ent validators) * fix(entadapter): cascade-delete agents on project delete (raw-store parity); test(hub): seed FK users, broker_name, deterministic UUIDs * test(hub): MaxOpenConns=1 for SQLite test store (serialize writes); tid() URLs + FK user seeds in events/stopall * test(hub): unwrap over-wrapped tid() in unit tests (workspace/logfilter/gcp/web); valid-UUID NotFound cases; tid() scheduled-event URLs * fix(ent): allow empty display_name (raw-store NOT NULL parity, email fallback); test(hub): seed FK owner users, UUID policy/broker/agent IDs in authz remediation * feat(migrate): add Migration β tool (Ent-SQLite → Ent-Postgres) Implements 'scion server migrate --from sqlite://... --to postgres://...' per postgres-strategy.md §7.3. - entc.OpenSQLiteReadOnly: opens source with PRAGMA query_only=ON (no WAL write), MaxOpenConns=1 so the source is never mutated. - entc.MigrateData: generic reflection-based, dependency-ordered copy of all 30 Ent entities (FK-ordered core first), idempotent (skips rows whose PK already exists), atomic per entity (txn), chunked CreateBulk, source/dest row-count verification after each entity, plus the Group.child_groups M2M edge. FK columns are plain fields so edges are preserved via setters. - cmd/server migrate: DSN parsing (sqlite://, file:, bare path; postgres URL or keyword form), --keep-source default / --drop-source cutover, progress logging. Verified end-to-end against live CloudSQL Postgres 16 (integration test + real CLI run): full copy, idempotent re-run, FK + M2M + value round-trips, --drop-source removal. * feat(concurrency): dialect-aware multi-replica primitives for Postgres (P3-3..6) Add cluster-coordination primitives so N stateless hub processes can share one Postgres, each degrading to a no-op on single-writer SQLite: - store.AdvisoryLocker + entadapter TryAdvisoryLock (pg_try_advisory_lock on a dedicated conn); Scheduler.RegisterRecurringSingleton gates the heartbeat, stalled, purge, schedule-evaluator and github-health sweeps to one replica/tick. - store.ScheduledEventClaimer + ClaimScheduledEvent atomic claim; fireEvent claims one-shot events before side effects (dedup across replica startup recovery). - CompositeStore.RunSerializable: SERIALIZABLE + retry on 40001/40P01 (single run on SQLite) for future multi-row invariants. - dbmetrics.StartPoolSampler feeds DB connection-pool gauges to the P0-5 scaffold; wired into StartBackgroundServices via SetDBMetrics. Verified existing primitives correct (agent StateVersion CAS, FOR UPDATE sweeps, notification atomic dispatch). Found and documented the schedule SKIP LOCKED early-commit gap (lock released before the status transition), closed by the singleton evaluator. Audit + budget docs in scratchpad. Tests: locking_test.go (advisory no-op, serializable, claim exactly-once incl. 8-way concurrent), pool_sampler_test.go. * feat(hub): widen events to EventPublisher interface + Postgres LISTEN/NOTIFY publisher P3-7: Decouple call sites from the concrete *ChannelEventPublisher. - Add Subscribe(patterns...) (<-chan Event, func()) to the EventPublisher interface; implement it on noopEventPublisher (nil channel) — *ChannelEventPublisher already had it. - Factor the Publish* methods into a shared eventBuilder (sink func) so every backend emits identical subjects/payloads; ChannelEventPublisher embeds it. - web.go (field + SetEventPublisher), messagebroker.go and notifications.go (field + constructor) now take EventPublisher; handlers_messages.go gates SSE on "not the no-op publisher" instead of a concrete type assertion. P3-8: PostgresEventPublisher over pgx LISTEN/NOTIFY (cross-replica delivery). - Per-grove channels plus a global channel (flat exact-match); event type in the JSON envelope. Grove-scoped subjects publish to both the grove channel and the global channel; subscriptions group their patterns by resolved channel so an event is matched only against patterns that opted into the arriving channel (no double delivery). - 8 KB NOTIFY limit handled by reference-and-refetch via scion_event_payloads (TTL-swept so every replica can refetch). - PublishTx enrolls the NOTIFY in a caller transaction (atomic write+publish; rollback => no deliver). Delivery flows exclusively through the listener. - Listener goroutine reconnects with backoff and re-LISTENs (resubscribe); dynamic LISTEN/UNLISTEN applied on a poll (WaitForNotification timeout does not invalidate the pgconn connection). - Emits pkg/observability/dbmetrics signals (published/delivered/dropped, payload size, publish->deliver latency, reconnects, pool stats). - cmd: newEventPublisher selects the backend by database driver (postgres => PostgresEventPublisher, else ChannelEventPublisher) with safe fallback. Tests: routing/registry/payload-offload/metrics/transactional-executor unit tests run without a DB; cross-replica delivery, oversized round-trip, transactional rollback, and reconnect+resubscribe are gated behind SCION_TEST_POSTGRES_DSN. go build ./... green; full pkg/hub suite green. Note: server.go's equivalent type-assertion cleanup is left in the working tree (co-edited with concurrent P0-5/scheduler work) and is functionally optional — HEAD server.go already compiles against the widened interface. * test(store): parameterize store suites over {sqlite, postgres} (P3-2) Add pkg/store/enttest: a backend-selecting Ent client factory for the store test suites. Default is in-memory SQLite; built with -tags integration and SCION_TEST_POSTGRES_URL set, it provisions a per-package ephemeral Postgres database (created/dropped via TestMain) and isolates each test in its own schema (search_path) so tests never observe each other's rows. Falls back to SQLite when the env var is unset. Route all entadapter and storetest helpers through enttest.NewClient so the same CRUD-parity oracle runs unchanged against either backend. Fix two real Postgres bugs surfaced by the new path: - entadapter/dialect.go ancestryContains: emit the bind parameter via Builder.Arg ($n on Postgres) instead of a literal '?' through ExprP, which was not rebound and produced a syntax error; and use jsonb_array_elements_text (the column is jsonb on Postgres, not json). - schedule_store_test ClaimPath: make the concurrent-claim assertion backend-aware. SQLite serializes (MaxOpenConns=1, no SKIP LOCKED) so every caller sees both due rows; Postgres uses FOR UPDATE SKIP LOCKED so concurrent callers may observe a disjoint subset (0..2) and must only never error or exceed 2. Verified: full SQLite suite green; storetest CRUD parity green on CloudSQL Postgres; entadapter green on Postgres (schedule ClaimPath fix confirmed). * fix(hub): harden Postgres event publish + verify wiring; lower PG pool default Task 1 — LISTEN/NOTIFY publish path: - Add TestPostgresIntegration_HandlerCreateProjectEmitsNotify: drives the real POST /api/v1/projects handler with a PostgresEventPublisher and asserts a pg_notify lands on scion_ev_global via an independent raw LISTEN — the exact capability the multi-replica live test probed. Verified PASSING against live CloudSQL, proving the handler -> s.events -> pg_notify wiring is correct end to end (the four pre-existing SCION_TEST_POSTGRES_DSN integration tests also pass). The multi-hub 'no NOTIFY' symptom was not reproducible against the current tree. - Bound the autocommit publish (Publish* methods) with publishTimeout (5s). These run synchronously on the caller's (request handler) goroutine and acquire from the event pool; on a connection-starved instance that acquire could block indefinitely, stalling CRUD and silently never emitting NOTIFY. The timeout converts that into a logged error + dropped event (publishing is fire-and-forget). PublishTx (transactional path) is unaffected. Task 2 — connection budget: - Lower the default Postgres MaxOpenConns 20 -> 10 so multiple replicas fit a modest connection budget (see CONNECTION-BUDGET.md). CloudSQL instance scion-postgres-test resized db-f1-micro -> db-g1-small and max_connections set to 100 (out of band). * test(store): add Postgres stress/integration suite (contention, isolation, pool, NOTIFY, migration, schema, multi-process) Add pkg/store/integrationtest/: a Postgres-only suite that exercises behavior the SQLite parity suites cannot reach. Gated by //go:build integration and SCION_TEST_POSTGRES_URL; skips cleanly otherwise. Coverage: - Contention: state_version CAS race (no lost updates, >=N-1 retries, final version==1+N), SKIP LOCKED / conditional-UPDATE event claim (single winner + disjoint drain), unique-key races (project slug, user email, agent slug). - Isolation: SERIALIZABLE conflict + RunSerializable retry recovery, REPEATABLE READ no-phantom snapshot, READ COMMITTED dirty-read prevention. - Pool: exhaustion + queued recovery, saturated pool honoring context deadline, long txn not starving short queries, healing after pg_terminate_backend. - LISTEN/NOTIFY: ordered burst no-drop, 8000B payload limit, listener reconnect/resume, cross-channel isolation. - Migration: 1000+ row counts + bounded-memory listing, idempotent re-migration. - Schema: NULL semantics, unicode/emoji, nested JSON + special chars, large-text non-truncation, TIMESTAMPTZ microsecond precision. - Multi-process: forks the test binary for cross-process advisory-lock exclusivity and cross-process NOTIFY delivery. Configurable concurrency via SCION_TEST_CONCURRENCY (default 10). Extend pkg/store/enttest with Active() and NewSchemaURL() so tests can open custom-pool clients and share a DSN with forked child processes; non-integration stubs keep the package API stable. * fix(db): recycle stale conns + keepalives; skip singleton tick on lock error Stale-connection pool stalls (CloudSQL drops idle conns after ~10m): - Add ConnMaxIdleTime to DatabaseConfig/PoolConfig (default 5m pg, 0 sqlite) and apply SetConnMaxIdleTime on the database/sql pool. - OpenPostgres now parses the DSN with pgx and opens via stdlib.OpenDB with TCP keepalive GUCs (idle 60s / interval 15s / count 4) and a 10s connect timeout, so a silently-dropped peer is detected instead of the first query after idle hanging on a dead socket. - pgx event pool (events_postgres.go): set keepalives + connect timeout on both the pool's ConnConfig and the dedicated listener connection, plus MaxConnIdleTime 5m / MaxConnLifetime 30m. Advisory-lock leader election (scheduler.go): - A lock-acquisition error no longer falls open to running the handler unguarded (which would duplicate singleton work across replicas); the tick is skipped and retried next interval. Added regression tests. Test harness (enttest/integrationtest): - Accept libpq keyword/value DSNs (not just URL form) when deriving the ephemeral db/schema/params; add WithConnParam helper. - Fix migration idempotency test's per-pass row-count expectation. * fix(store): bound advisory-lock conn checkout + unlock with short timeout TryAdvisoryLock checked a connection out of the pool and ran the unlock on the full 55s scheduler-handler context (acquire) and an unbounded context.Background() (release). On a pool that could not promptly serve a healthy connection, db.Conn() blocked for the entire 55s before failing with 'context deadline exceeded' on every tick; with several singleton handlers firing each 60s tick, those long-blocked goroutines and their pending pool connection requests piled up across ticks and kept the pool jammed (checked out client-side, idle server-side). The unbounded unlock was a second leak vector: if the held connection died mid critical-section, ExecContext could hang forever, so conn.Close() never ran and the connection leaked out of the pool permanently. Bind both the acquire (db.Conn + pg_try_advisory_lock) and the release (pg_advisory_unlock) to a 5s timeout so a bad tick fails fast and retries next tick instead of parking a goroutine for ~55s, and so a dead connection can never block release from freeing the conn. Lock semantics are unchanged: cancelling the acquire context tears down only that context, not the checked-out session that holds the lock. * feat(migrate): in-process migration α (legacy raw-SQL hub.db → Ent) Upgrade a legacy raw-SQL Hub database (the ~53-migration, 30-table schema from the removed pkg/store/sqlite store) to the consolidated Ent-backed SQLite schema, in-process on first boot, behind an automatic backup. pkg/ent/entc/migrate_alpha.go: - IsLegacyRawSQLSchema: detect via the schema_migrations sentinel + the legacy-only agents.agent_id column (no-op for an Ent/empty/absent file). - MigrateAlphaSQLite: backup (checkpoint WAL + copy to hub.db.bak.<ts>), AutoMigrate a fresh Ent schema, ATTACH the legacy file, copy every table with INSERT…SELECT (foreign_keys OFF), verify per-table row counts, then atomically swap the migrated file into place. - Data-driven column mapping (created_at→created, updated_at→updated, agents.agent_id→slug, policies→access_policies); bespoke SQL for the group_members/policy_bindings polymorphic splits and surrogate ids; groups.parent_id→group_child_groups edge. - Deterministic UUIDv5 remap for legacy non-UUID primary keys (internal signing-key secrets; plugin runtime-broker ids) with consistent rewrite of every foreign-key reference via a TEMP _id_remap table. - Tolerates missing legacy tables (older schema versions). cmd/server_foreground.go: detect + migrate in initStore's sqlite path, with a --no-auto-migrate operator opt-out (cmd/server.go). Validated end-to-end against four production hub.db files (scion-integration, -integration2, -demo, -gteam): exact row-count parity (up to ~19k rows), every entity reads back through the live Ent store, idempotent re-runs, and broker FK references resolve post-remap. Pre-existing dangling agent created_by/owner_id refs are faithfully preserved (loader runs FK-off). * fix(config): apply real Postgres pool size (leaked SQLite default of 1 starved the pool) The struct-level default for Database.MaxOpenConns/MaxIdleConns is 1 — the value SQLite REQUIRES to serialize writes. applyDatabasePoolDefaults only bumped postgres to a real pool when the value was <= 0, but a postgres deployment configured via env/driver override inherits the embedded default of 1, so the guard never fired and the Ent pool ran with a SINGLE connection. Effect in production (both integration hubs): every singleton scheduler tick checks out the lone pool connection to hold its advisory lock, then blocks waiting for a second connection to do its work — a self-deadlock that resolves only at the 55s handler context deadline. All API requests serialize behind the one connection, so GET /api/v1/* served in ~55s across the board. Note env overrides could not paper over this: envKeyToConfigKey splits on every underscore, so SCION_SERVER_DATABASE_MAX_OPEN_CONNS maps to database.max.open.conns, not database.max_open_conns — silently ignored. Treat the leaked SQLite default (<= 1) as 'unset' for postgres so the pool default (10) applies; explicit sizing of 2+ is still respected. SQLite remains pinned to 1. Adds regression tests for all three cases. * feat(hub): per-process instanceID on Server (B1-1) Add a unique per-process instanceID to Server, generated at construction via uuid.NewString(). Optionally prefixed with POD_NAME env var for log readability, but uniqueness is always guaranteed by the UUID. This ID serves as the affinity key for broker dispatch (design §4.1) and is intentionally distinct from config.ResolveHubID, which is shareable across replicas. * feat(schema): affinity columns on runtime_brokers (B1-2) Add 3 nullable fields to the runtime_brokers ent schema and store model for tracking which hub instance holds the control-channel socket: - connected_hub_id (TEXT, optional/nullable) - connected_session_id (TEXT, optional/nullable) - connected_at (TIMESTAMPTZ, optional/nullable) Dialect-neutral (no Postgres-only annotations) — AutoMigrate works on both SQLite and CloudSQL Postgres per postgres-strategy.md §6.4. Wire the fields through the ent<->store conversion code in both directions (entBrokerToStore, CreateRuntimeBroker, UpdateRuntimeBroker). Regenerated ent code included. * feat(store): Claim/Release runtime-broker affinity CAS methods (B1-3) Mirrors UpdateRuntimeBrokerHeartbeat's lock_version CAS loop. - ClaimRuntimeBrokerConnection: newest-wins, sets affinity + status=online + heartbeat in one write - ReleaseRuntimeBrokerConnection: compare-and-clear, returns cleared=false (no-op) if affinity moved (disconnect-race fix) Tests cover claim/overwrite/clear/no-op + A->B flap (design 9.4). * fix(hub): thread sessionID through connect + fix onDisconnect clobber race (B1-4, B1-5) B1-4: HandleUpgrade returns sessionID; markBrokerOnline(brokerID, sessionID) now calls ClaimRuntimeBrokerConnection(brokerID, instanceID, sessionID), recording affinity + online + heartbeat in one CAS write. B1-5: SetOnDisconnect callback gains sessionID; the handler compare-and-clears via ReleaseRuntimeBrokerConnection and skips the offline stamp when affinity has moved (flap). removeConnection now only removes/fires for the matching session, so an old connection's teardown can't drop a newer live socket. * feat(schema): broker_dispatch intent table + messages dispatch-state (B2-1, B2-2) B2-1: new BrokerDispatch ent entity (table broker_dispatch) — id, broker_id, agent_id(null), agent_slug, project_id(null), op, args(JSON), state, result, claimed_by, attempts, error, created_at/updated_at, deadline_at(null); index (broker_id,state). store.BrokerDispatch model + state constants. B2-2: messages.dispatch_state (default 'pending') + dispatched_at; wired through store.Message + entadapter conversion/create. Dialect-neutral. * feat(hub): PostgresCommandBus LISTEN/NOTIFY signal listener on scion_broker_cmd (B2-4) Introduce a CommandBus interface and PostgresCommandBus implementation that listens on the new global channel scion_broker_cmd for broker dispatch wakeup signals. This is a sibling of PostgresEventPublisher, reusing the same connect/reconnect/keepalive helpers but maintaining its own independent pgx connection and pool (design §5.1). Key components: - PostgresCommandBus: LISTEN loop with backoff-reconnect on its own dedicated connection; filters signals by local broker ownership via an injected ownsLocally func (wired to ControlChannelManager.IsConnected); invokes an injected onSignal reconcile callback (to be wired to the reconcile drain in B2-5). - NotifyBrokerCmd: issues NOTIFY inside the caller's transaction so the signal commits atomically with the durable intent row (mirrors PublishTx). - NoopCommandBus: safe no-op for the SQLite backend (single-process, all brokers are local). - Backend selection in newCommandBus mirrors newEventPublisher: Postgres driver → PostgresCommandBus; otherwise → NoopCommandBus. - Server.SetCommandBus/CommandBus() setter/getter; cleanup in both Shutdown and CleanupResources paths. * feat(store): BrokerDispatch store methods + message dispatch CAS (B2-3) BrokerDispatchStore: Insert/Claim(CAS pending->in_progress)/Complete/Fail/ ListPendingDispatch + MarkMessageDispatched(CAS)/ListPendingMessages (via agent runtime_broker_id). Wired into CompositeStore + store.Store. Tests: concurrent claim single-winner (exactly-once), drain pending-only, message CAS dedupe, complete/fail transitions, pending-messages-by-broker-agent. * feat(hub): reconcile-on-connect drain wired to bus + markBrokerOnline (B2-5) Server.reconcileBroker drains pending broker_dispatch rows (CAS-claim -> exec -> done/fail) and pending messages (CAS MarkMessageDispatched -> deliver) for a broker this node owns. Exactly-once via store CAS; idempotent + concurrent-safe. Wired as durability backstop into markBrokerOnline (async on reconnect) and as the command-bus signal handler (SetOnSignal -> ReconcileBroker). Op executors are seams (executeDispatch/deliverMessage) that Phase 3/4 fill with local tunnel ops. * feat(hub): route() decision in HybridBrokerClient (B3-1) routeLocal (IsConnected, unchanged fast path) | routeForward (affinity owner alive) | routeHTTP (broker endpoint set) | routeUndeliverable. Affinity is a hint only (StoreAffinityLookup over connected_hub_id + last_heartbeat freshness), injectable for testing. Not yet wired into dispatch (B3-2 wires message path). Table-driven tests over all branches incl. local-precedence + nil-affinity. * feat(hub): cross-node message dispatch via route()+intent+signal+owner drain (B3-2, B3-3) Route-gate the message send path: HybridBrokerClient.MessageAgent now uses route(brokerID, endpoint) to decide delivery. routeLocal and routeHTTP follow existing paths unchanged. routeForward/routeUndeliverable return ErrMessageDeferred — the message row (already persisted with dispatch_state=pending) is the durable intent. All call sites (handleAgentMessage, set[], broadcastDirect, messagebroker, notifications, scheduler) catch the sentinel, emit a best-effort NOTIFY wakeup via SignalBrokerCmd, and return 202 Accepted (or log as deferred). Fill the deliverMessage seam in reconcile.go: resolves the agent from the message's AgentID, obtains the dispatcher, and calls DispatchAgentMessage for local tunnel delivery. reconcileBroker already CAS-marks dispatched before calling this. Wire SetAffinityLookup(StoreAffinityLookup(store, 0)) on the HybridBrokerClient in CreateAuthenticatedDispatcher so route() can return routeForward when another node owns the broker. Add SignalBrokerCmd to the CommandBus interface — a best-effort NOTIFY using the bus's own pool, used by the message path where the durable intent is the message row itself and the NOTIFY is only a wakeup hint. * feat(hub): lifecycle dispatch (rolling-timeout wait + cross-node start/stop/restart) (B4-1, B4-2) B4-1: Rolling-timeout wait helper (dispatch_wait.go) - waitForAgentTransition subscribes to agent.<id>.status events and loops with a rolling window (dispatchRollingTimeout=90s) that resets on ANY AgentStatusEvent (phase/activity/detail change). - Terminal phase → return phase, nil. Window expiry → ErrDispatchFailed. Context cancellation → ctx.Err(). - Caller subscribes BEFORE writing intent, passes the channel + unsub. B4-2: Cross-node start/stop/restart dispatch - Route-gated HybridBrokerClient.StartAgent/StopAgent/RestartAgent exactly like MessageAgent: routeLocal → control-channel tunnel (unchanged fast path), routeHTTP → HTTP fallback, routeForward/routeUndeliverable → ErrLifecycleDeferred. - Dispatch args structs (dispatch_args.go): StartDispatchArgs captures task, resolvedEnv, resolvedSecrets, inlineConfig, sharedDirs, sharedWorkspace, projectPath, projectSlug, harnessConfig. RestartDispatchArgs captures resolvedEnv. StopDispatchArgs is empty. All JSON-serializable for broker_dispatch.args column. - Owner-side executeDispatch (reconcile.go): start/stop/restart cases deserialize args, load agent from store, call local DispatchAgentStart/Stop/Restart via the dispatcher. Unknown ops (delete, finalize_env, etc.) still fail cleanly for B4-3/B4-4. Tests: waitForAgentTransition (terminal, error, rolling reset, silence expiry, context cancel, unsub); route-gating of Start/Stop/Restart returns ErrLifecycleDeferred when non-local; executeDispatch lifecycle cases invoke the local dispatcher; args round-trip (serialize→deserialize) is lossless; reconcile end-to-end lifecycle path. * feat(hub): wire originator-side cross-node lifecycle dispatch (B4-2 complete) The originator-side orchestration was missing: ErrLifecycleDeferred was returned by HybridBrokerClient but nothing caught it. Now the full cross-node start/stop/restart flow works transparently to all handler call sites. Originator side (HTTPAgentDispatcher): - DispatchAgentStart/Stop/Restart catch ErrLifecycleDeferred after env/secret resolution and invoke deferredLifecycle: 1. Subscribe("agent.<id>.status") BEFORE writing intent 2. InsertBrokerDispatch{op, agent_id, broker_id, args} 3. Best-effort SignalBrokerCmd (row is durable backstop) 4. waitForAgentTransition with terminal set per op 5. Return nil on success, error on error-phase/timeout - SetCrossNodeDeps(events, commandBus) wired in server.go's getOrCreateDispatcher, so all handler call sites get cross-node for free with synchronous semantics preserved. - Local path (routeLocal) is unchanged at zero added latency — no subscribe, no intent row, no wait. Args decision: owner RE-RESOLVES env/secrets via DispatchAgentStart (all hub instances share the same store + secret backend), so StartDispatchArgs carries only {Task}. RestartDispatchArgs and StopDispatchArgs are empty. This avoids serializing potentially large env/secrets into the DB while remaining correct because all hubs read from the same shared store. waitForAgentTransition refactored to a standalone function (no Server receiver) so the dispatcher can call it directly. Tests: - TestDeferredStart_WritesIntentAndWaits: deferred start writes a broker_dispatch row, waits, returns success on "running" event - TestDeferredStart_ReturnsErrorOnErrorPhase: error phase → error - TestLocalStart_SkipsIntentRow: local path calls tunnel directly, no intent row written - All existing tests pass (no regressions) * fix(hub): make web session replica-portable to fix OAuth state_mismatch OAuth login behind the load balancer intermittently failed with state_mismatch: the CSRF state token (and the entire web session) was stored in a gorilla FilesystemStore on the handling replica's local disk, while the browser only carried a session-ID cookie. When the LB routed /auth/login and /auth/callback to different replicas, the callback replica had no matching session file -> empty state -> state_mismatch. It only "worked" when both hops happened to hit the same backend. The same flaw affected the post-login session: sessionToBearerMiddleware reads the Hub access/refresh JWTs from that disk-local store on every API request, so sessions silently dropped whenever a follow-up request landed on a different replica. Replace the FilesystemStore with an encrypted, signed gorilla CookieStore so the whole session lives in the client's cookie and any replica sharing SESSION_SECRET can read it. Keys are derived deterministically from SESSION_SECRET (32-byte HMAC auth key + 32-byte AES-256 encryption key, domain-separated). No DB, no migration; works with N replicas. The original switch to disk was motivated by a "JWT tokens exceed 4096 bytes" concern. Measured against the current compact HS256 tokens the full session (identity + access + refresh) encodes to ~2.6 KB, well under the browser's ~4 KB per-cookie cap, so the securecookie length limit is left in force (oversize would now error+log, not silently drop). Tests: replace the obsolete NoMaxLengthLimit test with a cross-replica round-trip regression test (cookie minted by replica A decodes on replica B with the same secret; carries OAuth state + post-login tokens) plus a negative test (a different secret cannot decode the cookie). * feat(hub): cross-node delete + create-time data ops dispatch (B4-3, B4-4) Route-gate HybridBrokerClient.DeleteAgent, CheckAgentPrompt, CreateAgentWithGather, and FinalizeEnv through route() so routeForward/routeUndeliverable return ErrLifecycleDeferred (matching start/stop/restart pattern from B4-2). B4-3 (delete dispatch): - deferredDelete on ErrLifecycleDeferred: subscribe broker.dispatch.<id>.done → InsertBrokerDispatch{op:delete} → SignalBrokerCmd → waitForDispatchDone (reads DB row, authoritative). - Owner executeDispatch case "delete": deserializes DeleteDispatchArgs → local DispatchAgentDelete (idempotent, 404 ok). - DeleteDispatchArgs struct + UnmarshalDeleteArgs for args round-trip. B4-4 (create-time data ops): - deferredDataOp/deferredDataOpResult: common originator flow for ops that return results via the dispatch row (design §6.3). Subscribe to broker.dispatch.<id>.done BEFORE writing intent, insert dispatch, signal, waitForDispatchDone, read result from GetBrokerDispatch. - deferredCheckPrompt: returns bool from CheckPromptResult in row. - deferredFinalizeEnv: fire-and-forget via deferredDataOp. - deferredCreateWithGather: returns envRequirements from row result. - Owner executeDispatch cases: check_prompt, finalize_env, create — run local op, marshal result JSON, return it. - PublishDispatchDone on EventPublisher: slim completion event broker.dispatch.<id>.done emitted by reconcile loop on complete/fail. - waitForDispatchDone: event-driven wait with bounded re-read at rolling timeout (missed event recovery, design §6.3). - GetBrokerDispatch added to BrokerDispatchStore interface + entadapter. Local fast path unchanged (routeLocal → zero added latency). * feat(hub): stale-affinity + stuck-dispatch reaper singleton (B5-1) * feat(hub): pending-message sweep + dispatch metrics (B5-2) Add observability for the multi-node broker dispatch pipeline: Sweep: - CountStuckPendingMessages store method (messages pending > threshold) - brokerMessageSweepHandler registered as RecurringSingleton with LockBrokerMessageSweep (0x5C100007), runs every 1m Metrics (pkg/observability/dispatchmetrics): - Counters: dispatch published/claimed/done/failed, message dispatched - Gauge: message stuck (pending beyond 5m threshold) - Histograms: intent-to-done latency, reconcile drain duration - Counter: command bus reconnects Emit sites: - InsertBrokerDispatch → IncPublished (httpdispatcher.go) - ClaimBrokerDispatch → IncClaimed (reconcile.go) - CompleteBrokerDispatch → IncDone + RecordDispatchLatency (reconcile.go) - FailBrokerDispatch → IncFailed (reconcile.go) - MarkMessageDispatched → IncMessageDispatched (reconcile.go) - reconcileBroker → RecordReconcileDrainDuration (reconcile.go) - command bus reconnect → IncCmdBusReconnects (command_bus.go) - sweep handler → ObserveMessageStuck (sweep.go) * fix(hub): derive JWT signing keys from shared SESSION_SECRET to fix cross-replica login loop The cookie-store fix (0515e2a8) made the web session replica-portable, but the Hub JWT *inside* the cookie is still signed with a per-replica key: ensureSigningKey scopes signing keys to (scope=hub, scope_id=hubID) and hubID = sha256(hostname)[:12]. The integration env runs two replicas of one logical hub behind a single LB, sharing one Postgres DB and one SESSION_SECRET but with different hostnames -> different hubIDs -> different HS256 signing keys. So a user JWT minted on replica A failed signature verification on replica B (go-jose: error in cryptographic primitive); refresh failed too (refresh token signed with the same foreign key), so sessionToBearerMiddleware declared the session irrecoverably invalid, DELETED the cookie (MaxAge=-1) and returned session_expired. The cookie deletion turns it into a redirect loop: dashboard flashes, then /login?error=session_expired. Fix: extend the 0515e2a8 approach (replica-portable via the shared secret) from the cookie to the keys inside it. Add ServerConfig.SharedSigningSecret; when set, ensureSigningKey derives the agent and user signing keys deterministically from it (domain-separated by key name) and bypasses per-host secret-backend storage. cmd feeds the same --session-secret / SESSION_SECRET value into both the web cookie store and the hub config via a new resolveSessionSecret() helper. Empty secret keeps the existing per-hub behavior (no regression for single-node/local dev). Tests: cross-replica round trip (different hubID + same secret -> identical keys, token minted on A validates on B; different secret cannot) plus pre-configured-key precedence. Note: rollout rotates the signing keys (now derived from SESSION_SECRET), so existing web/CLI tokens are invalidated once and users re-login. * docs: project log for B5-3 chaos gate — GB5 PASSED (GA gate for broker dispatch) * fix(hub): align fakeHTTPClient.CleanupProject with interface (3 params, not 4) * fix(hub): address PR #305 review feedback - server_migrate.go: use nil-checked deferred close for src DB, and explicitly close src before dropSQLiteFile to prevent Windows sharing violations - server_migrate.go: handle file:// prefix before file: to correctly parse file:///path/to/db URLs - server_foreground.go: evaluate GetControlChannelManager() inside the ownsLocally closure to avoid capturing a stale nil value - server_migrate_test.go: add test case for file:/// URL format - server_test.go: sanitize t.Name() slashes in newTestStore to prevent SQLite path errors in subtests * docs: add project log for PR #305 review feedback fixes * fix(hub): prevent duplicate message delivery, guard dispatch state transitions C1: Call MarkMessageDispatched after successful local dispatch in messagebroker.go and handlers.go (single-recipient, set[], broadcast). Without this, successfully dispatched messages remained dispatch_state=pending and were re-delivered on every broker reconnect via reconcileBroker. C2: Return immediately in messagebroker.go deliverToAgent when CreateMessage fails — without a durable row, a deferred signal has nothing for the owning node to reconcile. C3: Guard CompleteBrokerDispatch and FailBrokerDispatch with state=in_progress CAS predicate so a done dispatch cannot be flipped to failed or vice versa. Update tests to claim before completing/failing to match the new CAS guard. * fix(hub): reconcile broker→eventbus and hub-native→hub-managed renames after rebase Post-rebase fixups to align the feature branch with main's refactoring: - broker package → eventbus package rename (types, imports, methods) - SetRecipient → GroupRecipient, SetMessageResponse → GroupMessageResponse - hubNativeProjectPath → hubManagedProjectPath - ProjectTypeHubNative → ProjectTypeHubManaged - populateAgentConfig gains ctx parameter - Add missing handleResourcesImport and handleMessageChannels handlers - Add ListChannels method to MessageBrokerProxy - Wire newCommandBus in server_foreground.go - Restore main's test fixtures for renamed APIs --------- Co-authored-by: scion-gteam[bot] <271067763+scion-gteam[bot]@users.noreply.github.com> Co-authored-by: Scion <agent@scion.dev>
…A Docker + Model B GKE) (GoogleCloudPlatform#306)
…GoogleCloudPlatform#303) * fix: atomic session-guarded broker disconnect to prevent reconnect race (GoogleCloudPlatform#131) The onDisconnect callback previously used separate ReleaseRuntimeBrokerConnection and UpdateRuntimeBrokerHeartbeat calls. When a broker disconnects and reconnects rapidly, the stale disconnect's offline stamp can clobber the new connection's online status because UpdateRuntimeBrokerHeartbeat has no session guard — it unconditionally overwrites status. Provider statuses are also clobbered and never restored by heartbeats, leaving the broker permanently invisible until hub restart. Add ReleaseAndMarkBrokerOffline which atomically clears affinity AND stamps status=offline in a single CAS write. If a concurrent reconnect has already claimed the broker with a new session, the compare fails and the callback is a no-op. Also add a re-check guard before updating provider statuses. * docs: add project log for broker disconnect race fix unification
…rm#301) * docs(design): reduced resource clone/delete design (resolved review) * refactor: remove dead Locked field from Template and HarnessConfig models Remove the Locked bool field, all 16 enforcement sites across 6 handler files, the force query parameter from delete endpoints, 3 locked-template tests, and add a DB migration to drop the column. No production code ever set Locked=true — this simplifies the handlers for the upcoming clone/delete feature. * feat: add harness-config clone endpoint, authz hardening, and slug uniqueness - Add handleHarnessConfigClone mirroring template clone - Add CheckAccess authz to deleteTemplateV2, handleTemplateClone, deleteHarnessConfig, handleHarnessConfigClone - Add DB migration V55: UNIQUE constraint on (slug, scope, scope_id) - Return 409 Conflict on slug collision during clone - Add clone failure cleanup - Add tests for clone, authz, and slug collision * feat(web): add Clone/Delete row actions and clone-from-global to resource list - Add Clone and Delete action menu to shared resource-list component - Add delete confirmation dialog with deleteFiles checkbox (default on) - Add clone dialog with name input and 409 collision handling - Add clone-from-global picker in project settings view - Unify on resource-changed event (migrate resource-imported) - Gate actions on capabilities (canClone, canDelete properties) * fix: address PR review — cleanup orphaned files on DB create failure, remove redundant clone method - Add stor.DeletePrefix cleanup when CreateTemplate/CreateHarnessConfig fails after files were already copied (prevents orphaned storage files) - Remove redundant confirmCloneFromGlobal method — confirmClone already handles cross-scope clone via the component's scope/scopeId properties * fix: adapt Locked removal and slug constraint to Ent-based schema Remove Locked references from entadapter, remove stale sqlite.go (replaced by Ent ORM upstream), add UNIQUE(slug, scope, scope_id) to Ent schema indexes, and regenerate Ent code. * fix: adapt tests and entadapter for Ent-based store (UUID IDs, no Locked) - Use api.NewUUID() for all test entity IDs (Ent enforces UUID format) - Remove Locked field from entadapter create/update calls - Remove stale sqlite.go (replaced by Ent ORM upstream) - Add UNIQUE(slug, scope, scope_id) to Ent schema indexes
…form#309) * fix(hub): make web session replica-portable to fix OAuth state_mismatch OAuth login behind the load balancer intermittently failed with state_mismatch: the CSRF state token (and the entire web session) was stored in a gorilla FilesystemStore on the handling replica's local disk, while the browser only carried a session-ID cookie. When the LB routed /auth/login and /auth/callback to different replicas, the callback replica had no matching session file -> empty state -> state_mismatch. It only "worked" when both hops happened to hit the same backend. The same flaw affected the post-login session: sessionToBearerMiddleware reads the Hub access/refresh JWTs from that disk-local store on every API request, so sessions silently dropped whenever a follow-up request landed on a different replica. Replace the FilesystemStore with an encrypted, signed gorilla CookieStore so the whole session lives in the client's cookie and any replica sharing SESSION_SECRET can read it. Keys are derived deterministically from SESSION_SECRET (32-byte HMAC auth key + 32-byte AES-256 encryption key, domain-separated). No DB, no migration; works with N replicas. The original switch to disk was motivated by a "JWT tokens exceed 4096 bytes" concern. Measured against the current compact HS256 tokens the full session (identity + access + refresh) encodes to ~2.6 KB, well under the browser's ~4 KB per-cookie cap, so the securecookie length limit is left in force (oversize would now error+log, not silently drop). Tests: replace the obsolete NoMaxLengthLimit test with a cross-replica round-trip regression test (cookie minted by replica A decodes on replica B with the same secret; carries OAuth state + post-login tokens) plus a negative test (a different secret cannot decode the cookie). * fix(hub): derive JWT signing keys from shared SESSION_SECRET to fix cross-replica login loop The cookie-store fix (0515e2a) made the web session replica-portable, but the Hub JWT *inside* the cookie is still signed with a per-replica key: ensureSigningKey scopes signing keys to (scope=hub, scope_id=hubID) and hubID = sha256(hostname)[:12]. The integration env runs two replicas of one logical hub behind a single LB, sharing one Postgres DB and one SESSION_SECRET but with different hostnames -> different hubIDs -> different HS256 signing keys. So a user JWT minted on replica A failed signature verification on replica B (go-jose: error in cryptographic primitive); refresh failed too (refresh token signed with the same foreign key), so sessionToBearerMiddleware declared the session irrecoverably invalid, DELETED the cookie (MaxAge=-1) and returned session_expired. The cookie deletion turns it into a redirect loop: dashboard flashes, then /login?error=session_expired. Fix: extend the 0515e2a approach (replica-portable via the shared secret) from the cookie to the keys inside it. Add ServerConfig.SharedSigningSecret; when set, ensureSigningKey derives the agent and user signing keys deterministically from it (domain-separated by key name) and bypasses per-host secret-backend storage. cmd feeds the same --session-secret / SESSION_SECRET value into both the web cookie store and the hub config via a new resolveSessionSecret() helper. Empty secret keeps the existing per-hub behavior (no regression for single-node/local dev). Tests: cross-replica round trip (different hubID + same secret -> identical keys, token minted on A validates on B; different secret cannot) plus pre-configured-key precedence. Note: rollout rotates the signing keys (now derived from SESSION_SECRET), so existing web/CLI tokens are invalidated once and users re-login. --------- Co-authored-by: Scion <agent@scion.dev>
…events (GoogleCloudPlatform#312) A rapid session.start → session.end sequence from a spurious sciontool could permanently reset an agent's phase even while the agent works normally. This adds two guards: 1. Phase regression guard: rejects transitions that would move an agent backward in its forward-progress lifecycle (e.g. running → starting) in both the status update handler and broker heartbeat handler. 2. Activity-driven phase auto-correction: when an activity that implies the agent is running (working, thinking, executing, etc.) arrives but the phase is pre-running, auto-promotes the phase to running. Fixes GoogleCloudPlatform#124
…GoogleCloudPlatform#313) Also unset SCION_PROJECT_ID when clearing hub context env vars, since IsHubContext() checks all four env vars and a leftover SCION_PROJECT_ID causes FindProjectRoot() to return a synthetic path instead of failing.
…tform#311) * Fix agent list task overflow and unify action buttons Task cell in list view used inline span styling that silently ignored max-width/overflow constraints, allowing long task text to push action buttons off-screen. Switch to display:-webkit-box with line-clamp:2 so text wraps to at most two lines with ellipsis. Card view action buttons now render icon-only (matching list view), with sl-tooltip and aria-label for accessibility. Both views share a single renderActionButtons helper, eliminating the duplicated button logic. Color-coded hover effects added to action buttons in both views: red for stop/delete, amber for suspend, green for resume/start. Closes GoogleCloudPlatform#134 Closes GoogleCloudPlatform#135 * Fix agent list task overflow and unify action buttons Task cell in list view used inline span styling that silently ignored max-width/overflow constraints, allowing long task text to push action buttons off-screen. Switch to display:-webkit-box with line-clamp:2 so text wraps to at most two lines with ellipsis. Card view action buttons now render icon-only (matching list view), with sl-tooltip and aria-label for accessibility. Both views share a single renderActionButtons helper, eliminating the duplicated button logic. Color-coded hover effects use translucent rgba backgrounds that work in both light and dark mode: red for stop/delete, amber for suspend, green for resume/start. Closes GoogleCloudPlatform#134 Closes GoogleCloudPlatform#135 * Add before/after screenshots for PR review Screenshots captured from the real running app (Vite dev server + fetch mock for agent data). Shows before/after for both issues in light mode and dark mode. * Fix hover on disabled buttons and tooltip on disabled terminal Add :not([disabled]) to hover CSS selectors so color-coded hover effects don't apply to disabled action buttons. Wrap the Terminal button in an inline-flex span inside sl-tooltip so the tooltip remains accessible even when the button has pointer-events:none.
* docs(design): auth proxy mode (Google IAP) architecture Add design for an exclusive proxy human-auth mode that derives the user from a verified Google IAP signed header (X-Goog-IAP-JWT-Assertion), reusing the existing domain/allowlist/admin provisioning controls. Also specifies a hub-minted transport-auth layer (dedicated SA, generalizing PR GoogleCloudPlatform#307) so agents can traverse the IAP / Cloud Run-invoker front door, with a generalized array-based token refresh. * refactor(hub): extract provisionUser, dedupe OAuth find-or-create Extract the duplicated find-or-create-user block from four OAuth handlers (handleAuthLogin, handleAuthToken, handleCLIAuthToken, completeOAuthLogin) into a single provisionUser method on Server. The new method encapsulates: 1. Authorization check (isUserAuthorized) with audit logging 2. GetUserByEmail / CreateUser (find-or-create) 3. Profile backfill (DisplayName, AvatarURL when empty) 4. Admin promotion (when admin list changes) 5. Hub membership enrollment (ensureHubMembership) Introduces ExternalUserInfo struct (decoupled from OAuthUserInfo) and ErrAccessDenied sentinel error for caller-side HTTP response mapping. This is Phase 0 of the auth-proxy-mode feature — pure refactor with no behavior change. The proxy middleware (Phase 1) will call the same provisionUser method. NOTE: No suspended-user check is added. The existing OAuth flow does not check user.Status == "suspended" either; adding it here would change behavior. This gap is documented for Phase 1. * docs(project-log): record provisionUser extraction findings * feat(auth): implement proxy auth mode with IAP JWT verification (Phase 1) Add exclusive proxy auth mode for Google IAP signed-header authentication: - pkg/hub/proxyauth.go (NEW): ProxyAuthenticator interface, IAPAuthenticator with ES256 JWT verification via go-jose/v4, JWKS lazy-fetch cache with periodic refresh + on-miss refresh for unknown kids + transient failure tolerance (last-good keys). - pkg/config: auth.mode selector (oauth|proxy|dev), auth.proxy section with provider/iap.audience/overrides in both DevAuthConfig (GlobalConfig) and V1AuthConfig (settings.yaml). Wire conversion in both directions. - pkg/hub/auth.go: Replace IP-only extractProxyUser branch with ProxyAuthenticator path. Add 60s resolution cache (ProxyUserCache) wrapping provisionUser — signature verification runs every request, only the store lookup is cached. Legacy extractProxyUser preserved when no authenticator is configured. - pkg/hub/handlers_auth.go: Add suspended-user gate to provisionUser — rejects Status=="suspended" with ErrUserSuspended. This is an intentional behavior change sanctioned by the design doc, closing the pre-existing OAuth suspended-login gap documented in Phase 0. - pkg/hub/web.go: In proxy mode, handleAuthProviders returns no OAuth providers; handleLogout redirects to IAP's clear_login_cookie endpoint. - cmd/server_foreground.go: Construct IAPAuthenticator when mode==proxy && provider==iap, wire into ServerConfig.ProxyAuth. Security: audience binding is mandatory; only the signed JWT assertion is authoritative (X-Goog-Authenticated-User-* headers ignored); clock skew ±30s; JWKS cache handles key rotation and transient fetch failures. * test(auth): add comprehensive IAPAuthenticator unit tests Tests using self-generated ES256 key pair + httptest JWKS server: - Valid assertion -> correct ProxyUserInfo (subject/email stripped, lowercased) - Bad signature -> error - Wrong audience -> error (mandatory binding) - Wrong issuer -> error - Expired token (past 30s skew) -> error - Missing header -> (nil, nil) fall-through - Unknown kid triggers JWKS refresh and succeeds - Custom issuer override for testing - HD (hosted domain) claim extraction - Email lowercasing - JWKS cache transient failure tolerance (serves last-good keys) * style: fix gofmt formatting in proxyauth_test.go and settings_v1.go * docs(project-log): record auth-proxy-mode Phase 1 implementation * config: add auth.transport config for outbound transport auth Add TransportAuthConfig (hub_config.go) and V1TransportConfig (settings_v1.go) for the transport-layer auth that lets agents traverse IAP / Cloud Run invoker front doors. Config supports mode (none|cloudrun_invoker|iap), oidcAudience, and platformAuthSA fields. Wire into V1↔GlobalConfig conversion and env key mapping. Phase 2 item 6 of auth-proxy-mode. * hub: add TransportTokenMinter interface and implementations Introduce the TransportTokenMinter interface for minting Google OIDC ID tokens that let agents traverse platform guards (IAP / Cloud Run invoker). Three implementations: - gcpTransportMinter: production impl using IAM Credentials API (generateIdToken) to impersonate a dedicated platform-auth SA. Uses already-vendored google.golang.org/api/iamcredentials/v1. - noopTransportMinter: returns error when transport auth is disabled. - FakeTransportMinter: exported test double for other packages. Also adds RefreshTokenEntry type for the generalized tokens[] array and parseJWTExpiry for extracting expiry from ID tokens. All tests pass with no live GCP dependency (httptest fakes). Phase 2 item 6 of auth-proxy-mode. * hub: wire transport token minter into ServerConfig and dispatch Add TransportMode, TransportAudience, TransportMinter fields to ServerConfig and wire them through to the Server struct and HTTPAgentDispatcher. Transport tokens are injected as env vars (SCION_TRANSPORT_TOKEN, SCION_TRANSPORT_AUDIENCE, SCION_TRANSPORT_TOKEN_EXPIRY) into agent dispatch payloads in all three dispatch paths (Create, Start, Restart). server_foreground.go constructs a gcpTransportMinter from auth.transport config, deriving audience from hubEndpoint for cloudrun_invoker mode. When transport mode is "none" or unset, no minter is created and no transport tokens are injected — zero impact on existing deployments. Phase 2 item 6 of auth-proxy-mode. * hub: extend token refresh response with generalized tokens[] array The agent token refresh handler now returns a tokens[] array alongside the existing token/expires_at fields for backward compatibility. Old clients ignore tokens[]; new clients use it to apply both app-layer and transport-layer tokens. When transport auth is configured (transportMinter != nil), the response includes a google_oidc transport token entry with the configured audience. When disabled, only the app scion_access entry appears. Transport token minting errors are logged but don't fail the refresh — the app token is always returned. Phase 2 item 7 of auth-proxy-mode. * sciontool: add pluggable OIDC transport for agent outbound auth Implement the agent-side transport-layer auth with two pluggable token sources: - injectedTokenSource: uses the hub-provided SCION_TRANSPORT_TOKEN env var (cold start), then refreshed via the tokens[] array on subsequent refresh calls. - metadataTokenSource: fetches OIDC from the GCE metadata server (passthrough/on-GCE mode, the PR GoogleCloudPlatform#307 pattern). Selection logic: SCION_TRANSPORT_TOKEN env → injected mode; else if on GCE → metadata mode; else → no OIDC transport. The oidcTransport RoundTripper injects Authorization: Bearer on outbound hub requests. Graceful degradation: if token fetch fails, the request proceeds without the header (the hub can still auth via X-Scion-Agent-Token). Client changes: - Add oidcSource field and configureOIDCTransport() in NewClient() - Update RefreshTokenResponse with tokens[] array (backward compat) - RefreshToken() applies transport tokens via applyRefreshTokens() - Refresh scheduling uses shortest-lived entry (5-min margin for transport tokens vs 2h for scion tokens) 23 new tests covering both sources, transport, configuration, end-to-end dual-header, and refresh token application. Phase 2 item 8 of auth-proxy-mode. * docs(project-log): record auth-proxy-mode Phase 2 implementation * docs: add IAP proxy auth deployment guide (Phase 3) Add comprehensive deployment documentation for the IAP + Cloud Run invoker topology, covering inbound human IAP authentication, outbound agent transport auth (dual-layer OIDC + scion token), security considerations, and an end-to-end GCP setup checklist. All config keys and env vars verified against shipped code. * fix: prevent JWKS cache stampede and add HTTP client timeout - resolveHTTPClient() now returns a client with 10s timeout instead of http.DefaultClient (which has no timeout), preventing hangs on JWKS fetches. Tests that inject their own HTTPClient are unaffected. - JWKS cache refresh now debounces on lastAttempted (set at the start of every attempt, success or failure) instead of lastFetched (success only). This prevents stampedes during persistent JWKS outages where every cache-miss would trigger an unbounded refresh. - Added a refreshing guard to prevent concurrent in-flight refreshes (proactive background refresh + synchronous miss-refresh could race). - Network I/O is now performed outside the write lock to avoid holding the mutex across HTTP requests. - Added TestJWKSCache_StampedePreventionDuringOutage to verify that repeated misses during an outage do not cause repeated fetches within the debounce window. * fix: replace custom splitJWT with strings.Split and cache IAM service - Replace the hand-rolled splitJWT function with strings.Split(token, "."). Behavior is identical for well-formed JWTs; the custom function is deleted. - Cache the IAM credentials service client in gcpTransportMinter using sync.Once so it is created once and reused across MintIDToken calls instead of creating a new HTTP client/service on every invocation. Uses context.Background() for the long-lived client construction; per-call ctx continues to be passed to .Context(ctx).Do(). FakeTransportMinter is unaffected.
…oogleCloudPlatform#302) * fix: resolve workspace file browser to groves/ instead of projects/ The Hub UI file browser was showing the wrong directory contents. The hubManagedProjectPath() function resolved workspace paths to ~/.scion/projects/<slug>/ (project metadata) instead of ~/.scion/groves/<slug>/ (the actual git checkout mounted as /workspace in agents). Reverse the lookup priority: check groves/ first, fall back to projects/, and default to groves/ when neither has content. Fixes GoogleCloudPlatform#130 * docs: add project log for issue GoogleCloudPlatform#130 workspace path fix * fix: guard hubManagedProjectPath against empty slug Prevent hubManagedProjectPath from resolving to the parent directory when called with an empty slug. Add unit test for this case.
…by/owner_id) The Agent Ent schema modeled created_by/owner_id as foreign keys to the users table. When an agent creates a sub-agent, those columns hold the *creating agent's* ID, which has no users-table row, so Postgres rejected the insert with a foreign-key violation. mapError maps that to ErrInvalidInput, surfacing as a detail-free "validation_error: Invalid input (status: 400)" on every agent-initiated `scion start`. User-created agents were unaffected, masking the regression (introduced when GoogleCloudPlatform#304 ported the agent store onto Ent). created_by/owner_id are polymorphic principal references (user OR agent), like ancestry. Drop the User-typed edges and keep them as plain principal UUID fields; resolve the delegation creator by ID and tolerate "no such user". Atlas AutoMigrate drops the two FK constraints on existing DBs at next boot. Tests: the sole sub-agent creation test only passed because it seeded a fake user row sharing the agent's ID — an impossible production state. Remove that workaround so it exercises the real path, and add store/ent regression tests asserting a non-user principal ID is accepted.
…o agent containers (GoogleCloudPlatform#322) * Add sciontool doctor and agent auth reset infrastructure When an agent's hub JWT expires and the refresh loop fails (e.g. hub signing key rotation), the agent becomes a zombie: running locally but invisible to the hub. This adds two features to diagnose and recover: 1. `sciontool doctor` command — runs inside the agent container to check env vars, token validity/expiry, hub connectivity, auth status, and GCP metadata/GitHub token health. Prints actionable remediation. 2. Auth reset mechanism — allows pushing a fresh token into a running agent without restarting. The flow is: - Hub generates a new agent JWT via DispatchAgentResetAuth - Broker's /reset-auth endpoint writes the token file via exec - Broker sends SIGUSR2 to sciontool init (PID 1) - Init re-reads the token, updates the hub client, restarts the token refresh loop, and sends an immediate heartbeat Also adds Client.SetToken() for in-memory token updates. * Add scion reset-auth CLI command and hub API endpoint Adds the user-facing `scion reset-auth <agent>` command that triggers an auth reset on a running agent via the Hub. Also adds: - Hub handler for POST /api/v1/agents/{id}/reset-auth - hubclient AgentService.ResetAuth() method --------- Co-authored-by: Scion Agent (eng-manager) <agent@scion.dev>
Adds a "Reset Auth" button in the agent detail header actions area,
visible when the agent is running. Clicking it calls the hub's
POST /api/v1/agents/{id}/reset-auth endpoint, which generates a
fresh JWT and pushes it into the running container without restart.
GoogleCloudPlatform#323) * Make SIGUSR2 signal best-effort in reset-auth handler The kill -USR2 step can fail (e.g. PID 1 is not sciontool init, or the process doesn't handle the signal). Since the token file write already succeeded and the refresh loop will pick up the new token without the signal, treat signal failure as a warning rather than returning a 500 error. * Add admin bulk reset-auth endpoint POST /api/v1/admin/agents/reset-auth-all lists all running agents and dispatches an auth reset for each, returning a per-agent success/failure summary. Admin role required. * Add Reset Auth All button to admin maintenance page Adds a Quick Actions section with a "Reset Auth — All Running Agents" button that calls POST /api/v1/admin/agents/reset-auth-all and displays a per-agent success/failure summary inline. --------- Co-authored-by: Scion Agent (eng-manager) <agent@scion.dev>
* refactor(hub): split handlers by resource * Fix handlers split review and lint issues --------- Co-authored-by: Scion Agent (handlers-split-dev) <agent@scion.dev>
Co-authored-by: Scion Agent (codex-harness-model-dev) <agent@scion.dev>
Co-authored-by: Scion Agent (codex-dialect-yaml-dev) <agent@scion.dev>
…m#485) Co-authored-by: Scion Agent (codex-dialect-core-dev) <agent@scion.dev>
Co-authored-by: Scion Agent (codex-dialect-yaml-dev) <agent@scion.dev>
* codex harness: enable notification hooks * codex harness: escape otel toml values * codex harness: rely on bundled dialect mapping --------- Co-authored-by: Scion Agent (codex-harness-hooks-dev) <agent@scion.dev>
* feat(antigravity): pin CLI binary to 1.0.11 from GitHub Releases Replace the auto-updater manifest fetch with a direct download from GitHub Releases, pinned to version 1.0.11. This improves build reproducibility and picks up USE_ADC env var support needed for ADC-based Vertex AI auth. AGY_VERSION is a build ARG for easy future bumps. TARGETARCH is mapped to the release asset naming convention (amd64 → x64). * feat(antigravity): switch vertex-ai auth to ADC via USE_ADC=1 vertex-ai auth now uses Application Default Credentials instead of requiring an AGY_TOKEN OAuth refresh token. When GCP env vars (GOOGLE_CLOUD_PROJECT + GOOGLE_CLOUD_LOCATION/REGION) are present, the provisioner selects vertex-ai and the wrapper sets USE_ADC=1. Changes: - config.yaml: remove required_files from vertex-ai type - provision.py: vertex-ai no longer requires or validates AGY_TOKEN - provision.py: autodetect prioritizes GCP env vars over token - provision.py: wrapper script exports USE_ADC=1 in GCP mode - oauth-token auth path is unchanged * fix(antigravity): fall back to os.environ for GCP env var detection The host-side forwarder (container_script_harness.go) only forwards GOOGLE_CLOUD_PROJECT and GOOGLE_CLOUD_REGION to auth-candidates, not GOOGLE_CLOUD_LOCATION. Fall back to checking os.environ directly so users who set GOOGLE_CLOUD_LOCATION are correctly detected as vertex-ai. --------- Co-authored-by: Scion Agent (antigravity-harness-dev) <agent@scion.dev>
* codex harness: project template instructions * codex harness: drop unused system prompt file * codex harness: harden instruction projection * codex harness: polish instruction projection output --------- Co-authored-by: Scion Agent (codex-harness-template-dev) <agent@scion.dev>
* codex harness: enable notification hooks * codex harness: escape otel toml values * codex harness: rely on bundled dialect mapping --------- Co-authored-by: Scion Agent (codex-harness-hooks-dev) <agent@scion.dev>
…m#489) Co-authored-by: Scion Agent (codex-dialect-core-dev) <agent@scion.dev>
…pability (GoogleCloudPlatform#490) - Extract session_id from .conversationId on PreInvocation/PostInvocation - Extract tool_input from .toolCall.args on PreToolUse - Remove false tool_name extraction from PostToolUse (field not in payload) - Declare max_model_calls as supported (model-start/model-end are wired) - Bump PROVISION_VERSION Co-authored-by: Scion Agent (antigravity-hooks-dev) <agent@scion.dev>
…atform#491) Co-authored-by: Scion Agent (codex-harness-template-dev) <agent@scion.dev>
Co-authored-by: Scion Agent (codex-harness-hooks-dev) <agent@scion.dev>
Co-authored-by: Scion Agent (codex-otel-dev) <agent@scion.dev>
* changelog: add June 15 entry Settings split-brain fix, skill bank web QA, agent-viz markdown rendering, Apple container parsing, Makefile build/install split * changelog: add June 16 entry (no changes) * changelog: add June 17 entry Message display fix: channel/threadID persistence, role-based visibility * changelog: add June 18 entry Harness journey P1, template import selector, agent-viz colors/replay, image sync to Hub, harness resolve fix, dependency bumps * changelog: add June 19 entry Harness config delete/image UI, agent logs fallback, skill publish idempotency, user-scope authz fix * changelog: add June 20 entry Antigravity file-based OAuth, skill create UX P1, harness-config name field, metrics pipeline fixes, capture auth UI * changelog: add June 21 entry Antigravity keyring restore and token path fixes, Agent Registry lifecycle hooks demo, build image name from config.yaml * changelog: add June 22 entry Skill multipart upload replacing signed-URL flow, Antigravity token format and keyring capture fixes, skill bank M5 review fixes * changelog: add June 23 entry Metrics reporting fixes, capture-auth "already exists" handling * changelog: add June 24 entry (no changes) --------- Co-authored-by: Scion Agent (changelog-daily) <agent@scion.dev>
…oudPlatform#497) USE_ADC is not yet functional in the AGY CLI, so vertex-ai auth falls back to requiring AGY_TOKEN with keyring injection. The os.environ fallback for GOOGLE_CLOUD_LOCATION and the 1.0.11 binary pin are preserved. Co-authored-by: Scion Agent (antigravity-adc-revert-dev) <agent@scion.dev>
…CloudPlatform#500) - Cast config to access `type` property, fixing TS2339 on union type - Remove unused variables (s, mc, t) in render methods, fixing TS6133 Co-authored-by: Scion Agent (ci-web-types-fix-dev) <agent@scion.dev>
* feat(opencode): add vertex-ai auth support Add vertex-ai as a third auth type for the opencode harness, matching the Claude harness pattern where vertex-ai is the lowest-priority fallback after direct credentials (api-key > auth-file > vertex-ai). Autodetects when GCP project + location env vars are present and gcp_metadata_mode is not "block". When selected, writes VERTEXAI_PROJECT and VERTEX_LOCATION to outputs/env.json. * fix(opencode): annotate unpopulated gcp_metadata_mode guard The gcp_metadata_mode field is never written to auth-candidates.json by the Go staging layer, making the guard inert. Add a comment noting it is reserved for future use rather than removing it, since the concept is actively used elsewhere in the system (e.g. claude_code harness). * fix(opencode): respect vertex_not_blocked guard and populate standard GCP env vars --------- Co-authored-by: Scion Agent (harness-oc-dev) <agent@scion.dev>
GoogleCloudPlatform#498) * fix(harness): stage required_files as secrets instead of bind-mounting For auth-file credentials declared in required_files (e.g. Codex auth.json), ApplyAuthSettings now reads the file content on the host and stages it as a 0600 secret file under agent_home/.scion/harness/secrets/<NAME>, recording the container-side path in a new file_secret_files field in auth-candidates.json. The FileMapping is removed from resolved.Files so the runtime does not bind-mount the credential file read-only — Codex crashes on startup when auth.json is a read-only bind-mount because it tries to chown/write the file. Non-declared file credentials (e.g. gcloud ADC) are unaffected and continue to pass through as bind-mounts. * fix(harness): use HasSuffix for file-secret matching, add absolute-path test stageFileSecretFiles() previously compared the normalized container path against normalize("~" + suffix), which only matched tilde-prefixed paths like ~/.codex/auth.json. When a FileMapping arrives with an absolute container path (e.g. /home/scion/.codex/auth.json) the comparison failed and the file was left as a read-only bind-mount instead of being staged as a secret, causing Codex to hit a read-only filesystem error on startup. Switch to strings.HasSuffix(normCP, suffix) so that both path forms match the same TargetSuffix declaration. Remove the now-unused normSuffix variable. Add TestContainerScriptHarness_ApplyAuthSettings_StagesFileSecrets_AbsolutePath to cover the absolute-path code path. --------- Co-authored-by: Scion Agent (codex-harness-arch) <agent@scion.dev>
…oogleCloudPlatform#499) After reading the staged auth secret content, validate it parses as JSON before writing it to ~/.codex/auth.json. An invalid (e.g. corrupted or truncated) secret file would previously be written to disk without warning, causing Codex to fail at startup with an opaque JSON parse error. Now we surface a clear error message at provisioning time and exit with EXIT_ERROR. Co-authored-by: Scion Agent (codex-harness-arch) <agent@scion.dev>
…-file mode (GoogleCloudPlatform#501) When auth-file mode is selected and the host staged the auth.json content as a file secret (CODEX_AUTH in file_secret_files), provision.py now reads the staged content and writes a fresh writable ~/.codex/auth.json (mode 0600). This fixes the crash at Codex startup: lchown /home/scion/.codex/auth.json: read-only file system Previously, auth.json was bind-mounted read-only and Codex could not chown or write to it during startup initialization. With the core fix, auth.json is no longer bind-mounted; provision.py materializes it as a regular file. Depends on: scion/authfile-secret-staging (core fix for file_secret_files staging) Co-authored-by: Scion Agent (codex-harness-arch) <agent@scion.dev>
…m#502) The handleMessageChannels handler existed but was never wired up in registerRoutes(), causing 404s for --channel flag in the CLI. Co-authored-by: Scion Agent (message-channels-fix-dev) <agent@scion.dev>
GoogleCloudPlatform#504) The generic "Run your Codex authentication setup" message didn't tell users which command to run. Replace it with the specific device-auth login command. Co-authored-by: Scion Agent (codex-login-hint-dev) <agent@scion.dev>
…GoogleCloudPlatform#505) Add PYTHONDONTWRITEBYTECODE=1 to minimalEnv() so Python never writes .pyc bytecache files during container provisioning. On Docker (non-rootless), pre-start hooks run as root, creating root-owned __pycache__ on the bind-mounted agent home. The broker runs as the host user and cannot delete those files, causing 'scion delete' to fail with 'permission denied'. The provisioner runs once per agent start so bytecache provides zero benefit. Co-authored-by: Scion Agent (pycache-fix-dev) <agent@scion.dev>
…ogleCloudPlatform#506) * feat(harness): add copilot harness bundle and build system integration Add the GitHub Copilot CLI harness bundle (from PR GoogleCloudPlatform#295) and register scion-copilot in the image build system: - targets.sh: add scion-copilot to ALL_STEP_IDS and all target groups - cloudbuild.yaml: add build step for scion-copilot - cloudbuild-harnesses.yaml: add build step for scion-copilot * revert: remove copilot from central build system Harness bundles should be self-contained with their own cloudbuild.yaml. The central build system is moving toward base-images-only. The per-bundle harnesses/copilot/cloudbuild.yaml handles its own build. * fix(copilot): resilient auth fallback in provisioner The provisioner now gracefully falls back to no-auth mode when: 1. auth-candidates.json has no env_vars (hub-registered harness configs are hydrated after env-gather, so the broker doesn't know what auth keys to stage for new harness types) 2. The _select_auth_method raises ValueError but no_auth.behavior is configured in the harness config Also adds env var fallback in _present_env_keys and _read_secret to check os.environ when the auth-candidates path has no entries (covering cases where the token is in the container env but wasn't staged through the auth pipeline). * fix(copilot): use glibc binary variant for Debian-based scion-base scion-base is Debian bookworm (glibc), not Alpine (musl). The Dockerfile was downloading copilot-linuxmusl-* which fails to load due to missing libc.musl-x86_64.so.1. Switch to copilot-linux-* glibc variant. * fix(copilot): remove invalid --no-banner flag from command The Copilot CLI does not have a --no-banner flag (it has --banner to enable). The invalid flag causes copilot to exit immediately with an error. Remove it from the base command. * fix(copilot): auto-trust workspace folder and suppress banner - Add trustedFolders=["/workspace"] and banner="never" to default settings - Provisioner ensures these defaults at provision time - Prevents the interactive "Confirm folder trust" dialog that blocks non-interactive agent operation * fix(copilot): bake settings.json into image for reliable trust bypass The home-dir staging from hub-registered harness configs may not always copy home/.copilot/settings.json into the container. Baking the file directly into the Dockerfile ensures trustedFolders and banner settings are present regardless of the staging mechanism. * fix(copilot): move trustedFolders to config.json where copilot reads it Copilot reads trustedFolders from config.json (auto-managed), not settings.json. Our settings.json entry was being ignored, causing the interactive "Confirm folder trust" dialog to block agent startup. - Create home/.copilot/config.json with trustedFolders pre-set - Update provisioner to write trustedFolders to config.json - Remove trustedFolders from settings.json (copilot ignores it there) - Update Dockerfile to COPY config.json into the image * fix(copilot): add defensive type checks per code review --------- Co-authored-by: Scion Agent (gh-copilot-harness-lead) <agent@scion.dev>
…oogleCloudPlatform#507) Co-authored-by: Scion Agent (new-project-form-fix-dev) <agent@scion.dev>
…cumulation (GoogleCloudPlatform#508) rm -rf web/dist before npm run build so old JS chunks don't persist across builds. Also include web/dist in the clean target. Restores the tracked .gitkeep after cleaning. Co-authored-by: Scion Agent (web-build-clean-dev) <agent@scion.dev>
…ction Use Docker/OCI convention to detect fully-qualified image references: if the first path component contains a '.' or ':' it's a registry domain, so keep the image as-is. Bare names and relative paths are rewritten to the configured image_registry. This is more correct than the scion-* prefix heuristic because: - Fully-qualified scion-* images from external registries are preserved - Non-scion bare names are now also rewritten (previously skipped) - No need for an image_pinned workaround flag Resolves ptone#265
1a23473 to
98439d4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements Alternative A from ptone#265: format-based image registry rewrite detection.
.or:) are kept as-is — e.g.ghcr.io/myorg/scion-elixir:latest,us-docker.pkg.dev/proj/repo/scion-claude:latest,localhost:5000/myimage:devimage_registry— e.g.scion-claude:latest,my-custom-agent:v2library/scion-claude:latestThis replaces the
scion-*prefix heuristic which broke when template authors used fully-qualifiedscion-*images from external registries. It also deprecates theimage_pinnedworkaround from PR GoogleCloudPlatform#425 — that flag is no longer needed since the format-based detection handles the use case natively.Changes
pkg/config/settings_v1.go: RewroteRewriteImageRegistry()— now uses Docker/OCI convention (.or:in first path component = registry domain) instead ofscion-*basename prefix checkpkg/config/settings_v1_test.go: Replaced test suite with 20 test cases covering bare scion names, bare non-scion names, relative paths, fully-qualified images (ghcr.io, us-docker.pkg.dev, docker.io, localhost:5000, custom.registry:5000), sha256 digests, and edge casesTest plan
go test ./pkg/config/... -run TestRewriteImageRegistry— all 20 cases passgo vet ./...— cleango build ./...— cleanResolves ptone#265