v8.13.0#1738
Merged
Merged
Conversation
Three changes to eliminate orphaned entries between collections:
1. break → continue in storeAppRunningMessage loop: for v2 messages
with multiple apps, skip apps that already have current data but
keep processing the rest. Previously broke out of the entire loop.
2. storeAppRunningMessage returns { stored, rebroadcast } instead of
true/false. The gossip handler only calls storeSignedAppRunningBroadcast
when stored is true, ensuring both collections accept or reject together.
3. Remove redundant 5-minute gossip validity check from
storeSignedAppRunningBroadcast — it's now gated on the location
store's acceptance, eliminating the timing edge where one store
accepts at the boundary and the other rejects milliseconds later.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The sigterm handler was mutating broadcastedAt on location records to force 7-minute TTL expiry. This broke the data contract — broadcastedAt is derived from signed data and should never change. Stale gossip could also overwrite the sigterm by passing the "is newer" check against the fake broadcastedAt value. Switch all 6 ephemeral collections to expireAt-based TTL (expireAt:0). expireAt is operational metadata we control, not part of the signed payload. Sigterm now sets expireAt = now + 7min on both locations and signed broadcasts without touching broadcastedAt. Also: split gossip validity (5min) from record expiry into named constants, add missing expireAt to error stores, fix empty-apps v2 handler to clean up signed broadcasts with broadcastedAt guard. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nodeStatusMonitor and storeAppRemovedMessage deleted from zelappslocation without touching fluxapprunningbroadcasts, leaving orphaned signed broadcasts (~44 per 20-minute monitor cycle). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- storeAppRemovedMessage: $addToSet excludedApps on v2 broadcast docs so the derived view skips removed apps without mutating signed data - storeSignedAppRunningBroadcast + batch sync: $unset excludedApps when a newer broadcast upserts (clears stale exclusions) - appLocationFromBroadcasts: filter out excluded apps after v2 unwind - reindexGlobalAppsLocation: also drop running broadcasts collection - explorer rescan: also drop running + installing broadcasts - Export handleMissingMasterSlaveContainer from stoppedAppsRecovery - Fix all 10 CI test failures, add excludedApps tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Single `appstateevents` collection replaces `fluxapprunningbroadcasts` as the source of truth. Five event types (apprunning, sigterm, appremoved, evicted) with dedupKey-based upserts and $cond timestamp guards. `zelappslocation` stays populated as materialized cache. - storeAppStateEvent() dispatcher with APP_STATE_EVENT_TYPES enum - storeBatchAppRunningEvents() for sync receiver - Gossip handler writes event unconditionally, then materializes location - Sigterm/appremoved/evicted all append events instead of mutating - Sync sender/receiver stream from event log - Remove storeSignedAppRunningBroadcast, excludedApps, gossip gating - 99 tests passing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The view now filters appremoved, sigterm, and evicted events, excludes stale v1 broadcasts superseded by newer v2, and correctly handles expired shutdown events. Verified against charlie live data (0 diff). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The event store was accepting gossip up to 125min old (RUNNING_EXPIRY_MS) instead of 5min (GOSSIP_VALIDITY_MS). Only the batch sync path should accept older messages. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nodeStatusMonitor deletes locations immediately on eviction, but the view was giving evicted IPs the same 7-minute grace period as sigterm. Eviction should be immediate — the monitor already verified the node is gone. Also extend eviction TTL to match apprunning (125min) so the eviction event outlives the apprunning events it suppresses. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
storeSignedAppRunningBroadcast no longer exists — stub storeAppStateEvent instead. Sigterm handler now calls updateInDatabase once (location expiry only) not twice, and storeAppStateEvent needs stubbing to prevent throw. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix undefined appsRunningBroadcasts in apiServer.js sigterm handler,
add storeAppStateEvent(SIGTERM) call for own shutdown
- Escape regex in appLocationFromBroadcasts to prevent injection
- Cap sync response batch size at 2500 in all 4 handlers
- Add IPCHANGED event type with view remapping so IP changes are
reflected in the event log view
- Await all storeAppStateEvent calls (was fire-and-forget)
- Use ?? instead of || for config fallbacks in orchestrator
- Optimise appLocationFromBroadcasts pipeline: $arrayToObject/$getField
for O(1) lookups instead of $filter scans (2900ms → 118ms), push
name filter into facet sub-pipelines (2666ms → 26ms for targeted)
- Standardise $gt (not $gte) for "only if newer" guards
- Add {createdAt: 1} index for sync sender evicted event queries
- Hash sync failure recovery: retry 3x with 5-min gap, block timer
fallback if retries exhausted, background 20-min recheck on
blockReceived for missing hashes
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests cover: retry on failure, block timer fallback when retries exhausted, readiness via block timer when hash sync never completes, and DB rebuild failure not blocking the state machine. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract streamBatchedSync helper from 3 nearly identical respondWith* functions. Rename MIN_SYNC_PEERS to MIN_SYNC_COMPLETIONS for clarity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
$getField with dynamic field references requires MongoDB 7.2+ (SERVER-74371). CI runs 7.0. Replaced $arrayToObject/$getField O(1) maps with $filter/$first lookups against small arrays. Structural optimization preserved: shutdown/v1 filtering at IP level before unwinding. Estimated ~200-300ms at full scale vs 118ms with $getField vs 2900ms with the original post-unwind approach. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- handleAppRunningEvent: reject empty-apps v2 when no prior events exist for that IP (matches location store behavior independently) - handleNodeSigtermMessage: check event log for app events instead of zelappslocation, so sigterm handling works without locations Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rename to reflect event log architecture (broadcasts no longer exist).
Change signature from positional appname to options object { appname, ip }
to support IP filtering. Sigterm handler now uses the full view derivation
to check for apps instead of a naive event log findOne.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Stores the time each node received/processed the event, alongside the original broadcastedAt from the source node. The delta reveals gossip propagation latency and helps diagnose messages that arrive near the 5-minute validity boundary. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Gossip path sets receivedAt on insert. Batch sync path preserves the sender's receivedAt so the original gossip reception time is retained across sync. Enables propagation latency diagnostics on installing and install error broadcasts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sigterm events had 7-min TTL matching the grace period, but apprunning events have 125-min TTL. After the sigterm TTL'd away, apps reappeared in the view with nothing to suppress them. Same race as the evicted TTL bug. Fix: sigterm event expireAt uses RUNNING_EXPIRY_MS (125 min) so the document outlives every apprunning it suppresses. The 7-min grace period is computed from eventAt in the view pipeline, not from expireAt. Export SIGTERM_EXPIRY_MS and use it in fluxCommunication.js and apiServer.js instead of hardcoded 420*1000. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The previous hash sync sent fluxapprequest to a single random peer per attempt, with a fixed 30s wait that couldn't cover the 75s response time for 500 hashes (150ms per hash on the responder). It also broke out on zero progress and reused the same peers. New algorithm: - Bulk threshold lowered from 1000 to 500 (matching fluxapprequest v2 cap) - Targeted path sends to 3 peers per round with poll-until-settled - Timeout proportional to hash count (count × 150ms + 5s buffer) - Settle detection: exits early when no new responses for 4s - Tracks tried peers — never repeats across rounds - Continues through all rounds regardless of per-round progress - Excludes deterministic peers (same-provider neighbors) - Bulk path aggregates responses from all peers instead of picking largest Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Moves the broadcast signing logic out of fluxCommunicationMessagesSender into utils/fluxBroadcastHelper. This breaks the circular dependency that prevented appHashSyncService from sending signed messages to peers (messageStore → messageVerifier → fluxCommunicationMessagesSender). appHashSyncService now uses fluxBroadcastHelper directly to sign and send fluxapprequest messages via peer.send(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The cycle messageVerifier → registryManager → messageStore → messageVerifier caused messageVerifier exports to be empty at load time, breaking checkAppMessageExistence for gossip message handling. Root cause: registryManager imported SIGTERM_EXPIRY_MS from messageStore, which created a circular require chain during module initialization. Fix: Move all TTL/expiry constants (GOSSIP_VALIDITY_MS, RUNNING_EXPIRY_MS, INSTALLING_EXPIRY_MS, INSTALLING_ERRORS_EXPIRY_MS, SIGTERM_EXPIRY_MS, EVICTED_EXPIRY_MS) from messageStore to appConstants. Update all consumers to import from appConstants instead. Also extracts serialiseAndSignFluxBroadcast into utils/fluxBroadcastHelper to cleanly separate broadcast signing from peer routing logic. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The ephemeral sync receiver stored appremoved/sigterm/evicted events to
the event log but didn't apply their location side-effects. This caused
syncing nodes to have stale locations that the sender had already deleted.
- appremoved: delete location entry for {ip, appName}
- sigterm: update expireAt on all locations for that IP
- evicted: delete all locations for that IP
Also gates ephemeral sync on network state readiness — the orchestrator
now requires both peer threshold AND node list populated before firing
sync requests. Prevents verification failures from unloaded node list.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
apiServer.handleSigterm used the old { ip, broadcastedAt, envelope }
format and referenced messageStore.SIGTERM_EXPIRY_MS which was moved
to appConstants. Updated to pass { message, envelope } so the full
signed payload is stored for sync re-verification.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sigterm, appremoved, and ipchanged event handlers were stripping type
and version fields from the stored data. When these events were synced
to another node, re-verification failed because the signature was
computed over the original full message, not the stripped version.
Now stores the complete message object as data so envelope + data
can be reconstructed for verification during sync.
Also updates all callers to pass { message, envelope } instead of
individual fields.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add TTL constants to appConstants stub (moved from messageStore)
- Update sigterm test to use { message, envelope } format
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
processMessages now checks permanent message existence in chunks of 2000 using a single $in query instead of individual findOne per message. Existing hashes are batch-marked as message:true via bulkWrite. Only genuinely new messages go through the sequential storeAppTemporaryMessage + checkAndRequestApp path. For the common case (most messages already exist), this reduces ~58k individual DB reads + writes to ~29 batch operations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The update path in checkAndRequestApp did two queries per message to
find the latest permanent message for an app name:
1. find({appSpecifications.name}) — loaded all docs, iterated in JS
2. find({zelAppSpecifications.name}) — full collection scan (no index,
0 results — legacy field from Zel→Flux rebrand, never populated)
Combined cost: ~48ms per message on 35k-doc collection. For 58k
messages during bulk hash sync, this added ~22 minutes of pure waste.
Fix:
- Remove zelAppSpecifications query entirely (dead code)
- Replace find-all + JS iterate with findOne using sort:{height:-1}
which leverages the existing {appSpecifications.name:1, height:-1}
compound index
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces the storeAppTemporaryMessage + checkAndRequestApp per-message flow with a single-pass verify-and-batch-insert approach: - Skips temp message storage entirely (was write + immediate read-back) - Eliminates 3 duplicate DB reads per message (existence checks done twice, getPreviousAppSpecifications done twice) - Eliminates duplicate signature verification - Pre-loads previous app specs for update messages per chunk (one $in query replaces N individual find-all queries) - Batch inserts permanent messages via insertMany - Batch marks hashes via bulkWrite - Keeps: hash verification, signature verification, app spec validation, price validation, name conflict checks for registers Also removes dead zelAppSpecifications query in messageVerifier checkAndRequestApp (unindexed full collection scan, 0 results). Replaces find-all + JS iterate with indexed findOne for update path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
validatePrice called appPricePerMonth (async) without await, causing price comparisons against Promise objects. Also restores specificationFormatter for consistent spec formatting before signature verification. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Registrations verified within a chunk are added to the prevSpecsMap so that updates later in the same chunk can find their previous specs without a DB round-trip. Eliminates the 30% failure rate where updates couldn't find registrations from the same chunk. The map is pre-loaded from DB per chunk (for cross-chunk lookups) and grown as registrations are verified. Memory bounded by unique app names per chunk. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds nodeConfigOverrides option to createTestEnv — a map of node index to config that merges on top of the global configOverrides. This allows setting different config on specific nodes, e.g. appSyncMinCompletions=3 only on the joining node without affecting source nodes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Only one joining node is needed. Also set appSyncPeerThreshold=3 so the peer threshold fires after 3 peers connect, matching the appSyncMinCompletions=3 requirement. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Health check timeout (5s) exceeded interval (3s), causing Docker's health state machine to produce spurious "unhealthy" on container restart. Reduced timeout to 2s across all container health checks. Docker's CloseMonitorChannel sets health status to "unhealthy" during monitor teardown (moby/daemon/container/health.go:80). On restart, HealthCheckWaitStrategy sees this transient state and destroys the container. Replaced restartNode to swap in an HTTP-polling wait strategy that bypasses Docker's health state machine entirely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bulk permanent message fetch now partitions missing hashes across peers and streams in parallel via Promise.allSettled, instead of sequential single-peer streaming. Each stream maintains its own 500-message backpressure — peak memory is ~1500 messages vs 500 previously. Targeted fetch and ephemeral rounds now chunk hashes into groups of 500 before calling broadcastHashRequest, fixing a latent bug where >500 hashes would exceed the fluxapprequest v2 message cap. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Parallel bulk fetch caused ~10-25% failure rate per batch because update messages couldn't find predecessor specs processed on other streams. Reverted to sequential streaming which maintains height ordering across all messages. Kept the broadcastHashRequest chunking at 500 for targeted fetch rounds and ephemeral rounds (latent bug fix). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The first checkAndNotifyPeersOfRunningApps call was triggered by peer threshold, before appStartupManager finished reconciling containers. This caused the broadcast to report 0 apps because Docker containers hadn't been started yet. The next broadcast wouldn't fire for an hour (peerNotifyIntervalMs). Gate the first broadcast behind waitForBootContainerStateSettled() so it runs after reconciliation completes and Docker state is accurate. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The first broadcast was racing with appStartupManager and reporting 0 apps. This test verifies the app:running SSE event includes the reconciled app after a simulated reboot, catching the race if the broadcast gate on boot:settled is removed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
If the HTTP poll times out, log a warning instead of throwing. Throwing triggers testcontainers' waitForContainer error handler which destroys the container, making the failure undiagnosable. The test's own assertions will catch the actual problem. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
appHashSyncService.js: messageStore, globalState appStartupManager.js: decryptEnterpriseApps, appUsesGSyncthingMode serviceManager.js: hashSyncIntervalMs, peerNotifyIntervalMs, locationTtlS, installingTtlS, installErrorTtlS, removalSpacingMs (dead — old interval logic moved to orchestrator) nodeStatusMonitor.js: fluxEventBus messageVerifier.js: scannedHeightCollection Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prevents a re-processed registration from overwriting a newer update spec. Mirrors the existing guard in updateAppSpecifications. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Documents the known divergence between secure and non-secure nodes for enterprise usersToExtend updates, and the planned resolution via Arcane attestations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The return value of apps.filter() was discarded, causing already-resolved apps to be re-requested via checkAndRequestMultipleApps. Idempotent but wasteful. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
chunk.toString() can corrupt multi-byte UTF-8 characters split across chunk boundaries. StringDecoder buffers incomplete characters across writes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pre-existing test fixture private key, not introduced by this PR but file was modified. Added to GitGuardian ignored_paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The broadcast gate change requires the globalState stub to provide waitForBootContainerStateSettled, otherwise the broadcast promise never resolves and the test fails. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace .catch(() => {}) with warnings that include the network
name and component. Silent swallowing masked resource leaks that
caused intermittent failures in later suites.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: event-driven app state sync with event log
An app owner can link an app to other apps by embedding a token in the app description text: networkWith:[appA,appB] (brackets required, quotes optional, key case-insensitive, comma separated). This is purely node-local behaviour — no app specification field, no validation change, no network consensus impact. When the token is present: - Before install/redeploy, the node verifies every named app is installed locally and owned by the same owner; otherwise the operation fails. - Each of the app's component containers is attached to the private docker network of every linked app (fluxDockerNetwork_<linked>), so it can reach that app's components by docker DNS name flux<component>_<linkedApp>, as if both apps were a single app. - When a linked-to app is (re)deployed, any locally installed app that is networked with it is reconnected to its network. New module appNetworkLinker.js holds the parser, the install gate, and the forward/reverse network wiring. The gate and forward wiring run in installApplicationHard/installApplicationSoft (the only callers of appDockerCreate), so every container-creation path is covered, including direct callers that bypass registerAppLocally (container health recovery and legacy v<=3 redeploys). Reverse wiring runs in registerAppLocally and softRegisterAppLocally; a boot-time reconcile sweep re-applies all links. dockerService gains an idempotent appDockerNetworkConnect helper. Adds tests/unit/appNetworkLinker.test.js (parser, gate, wiring, reconcile) and appDockerNetworkConnect coverage in dockerService.test.js.
- extract APP_NAME_REGEX (v8+) and APP_NAME_REGEX_LEGACY (v<=7 / components) into appConstants; consume from appValidator and appNetworkLinker - move getAppContainerNames / getAppContainerObjects into dockerService; anchor the multi-component match to ^(?:flux|zel)[a-zA-Z0-9]+_<app>$ and escape regex metacharacters in the app name; refactor getNextAvailableIPForApp to use the same helper - rewrite appDockerNetworkConnect to inspect the container's NetworkSettings.Networks first and skip the connect when already attached; drop the blanket 403 catch (overloaded by docker) in favour of a narrow already-exists message match as a TOCTOU race fallback - update affected unit tests
When a SEND component is being installed in an app whose own compose has no LOG=COLLECT component, walk every app it is networkWith-linked to and ship to the first linked app that exposes a collector. Reachability is provided by the existing networkWith wiring (sender's container is already attached to the linked app's private docker network). Enterprise linked apps whose compose is blanked in the local DB and cannot be decrypted on this node are skipped — the SEND container falls back to json-file logging with a warning. Same fallback applies if the collector container is not reachable at install time. - new appNetworkLinker.findLinkedAppLogCollector(fullAppSpecs) that resolves the linked app + component name (handles the legacy enviromentParameters typo too) - appDockerCreate calls it as a fallback after the existing in-compose collector lookup, only for SEND components
feat: app-to-app network linking via networkWith description token
|
| GitGuardian id | GitGuardian status | Secret | Commit | Filename | |
|---|---|---|---|---|---|
| 32907135 | Triggered | Generic Private Key | 83c22a1 | test-infra/fixtures/registry-tls/server-key.pem | View secret |
| 10071586 | Triggered | Generic High Entropy Secret | 0da0c94 | tests/unit/fluxCommunicationMessagesSender.test.js | View secret |
🛠 Guidelines to remediate hardcoded secrets
- Understand the implications of revoking this secret by investigating where it is used in your code.
- Replace and store your secrets safely. Learn here the best practices.
- Revoke and rotate these secrets.
- If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.
To avoid such incidents in the future consider
- following these best practices for managing and storing secrets including API keys and other credentials
- install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.
🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
v8.13.0
A major release focused on architectural overhaul of app state propagation, hash synchronization, node lifecycle management, and startup
orchestration. ~100 commits, +23.7k / -3k lines across 100+ files.
Highlights
App state event log — Replaces the dual-collection
fluxapprunningbroadcasts+zelappslocationmodel with a single append-onlyappstateeventslog as source of truth (apprunning, sigterm, appremoved, evicted, ipchanged).zelappslocationis now a materialized cache derivedfrom the event log via aggregation, eliminating the orphaned-entry class of bugs where the two collections drifted out of sync. TTL switched from
mutating signed
broadcastedAtto operationalexpireAtfield.Node confirmation service — New service with three-level status tracking (
isConfirmed/canSendMessages/isDaemonStale). Outbound signedmessages, peering, and hash sync are gated on confirmation status. Daemon staleness >125min triggers app removal; >320min flips confirmation off.
Replaces 326k/day log-spam loop on expired nodes.
Hash sync rewrite — Multi-peer targeted requests (3 peers/round with poll-until-settled, event-driven 4s settle window), bulk threshold lowered
to 500, exponential backoff (0/50min/4h/21h/4d/17d/35d → permanent after ~1yr), ephemeral peer connections to deterministic node list as fallback,
and a fast bootstrap path via daemon address index (
getaddresstxids+ batchgetrawtransaction) that cuts initial explorer sync from ~9.5h to~4min.
FluxOS-managed container startup — Single ownership model replacing the split where Docker auto-started on powercut and FluxOS managed on clean
shutdown. Container restart policy default →
no; FluxOS now owns all startup decisions. Boot context (heartbeat withmachineBootId, shutdownreason on SIGTERM) drives reconciliation: FluxOS restart skips recovery, expired locations trigger immediate removal, 5-min sync timeout removes
apps.
Orchestrator state machine — Formalized states (INITIALIZING / SYNCING / RESYNCING / READY / DEGRADED) with deterministic transitions. Boot path
gates:
daemonReady→confirmed→dbReady→bootContainerStateSettled. Block-driven hash retry scheduling replaces the 30h reconstruct-tiedcycle. Peer loss during SYNCING/READY transitions to DEGRADED.
Signed sync requests — Binary frame extended with
requestTimestamp+pubkey+signature(0x20-0x23 opcodes). Handlers verify identitybefore opening MongoDB cursors, preventing unauthenticated peers from triggering expensive server-side work.
Performance
processMessages: batch existence checks via single$inper 2000-message chunk, eliminated duplicate verify/read passes, batchinsertMany+bulkWrite. ~58k individual ops → ~29 batch ops on bulk sync.zelAppSpecificationsfull-collection scan (legacy Zel→Flux rebrand, 0 results) — saved ~22 min per full hash sync.appLocationFromEventsview: optimized aggregation (~2900ms → ~26ms for targeted queries) with name filter pushed into facet sub-pipelines.bulkWrite+updateManyaggregation replacing 58k+ individualupdateOnecalls.waitForDaemonRpc.Bug fixes
usersToExtendon non-ArcaneOS, missingprevSpecdecryption inprocessMessages, owner-change race (height-gated <2M for legacy network behavior).updateAppSpecificationssplit intoinsert(upsert) +update(no upsert) so the cache-update path can't resurrectcancel/expire-deleted entries. Reconstruct cycle now invalidates hash sync via
hashesReconstructedevent so newly-eligible hashes get retried.prevSpecsMapuses height-aware lookup for re-registered apps (was returning newest-by-name, picking wrong owner across registration cycles).messageNotFoundblock threshold corrected for 30s post-PON blocks (* 12→* 48).ws.terminate()instead ofws.close()(~33s → ~4s).setInterval→ self-schedulingsetTimeoutto prevent concurrent RPCs.Architecture / refactors
messageStore→appConstants;serialiseAndSignFluxBroadcastextracted tofluxBroadcastHelper;deleteLoginPhrasemoved fromserviceHelper→idService.appSyncEventsevent bus replaces mutable module state setters (setOnSyncComplete, EventEmitter inheritance, ad-hoc thunks).fluxEventBuspublishesconfirmation:changed,daemon:unreachable/recovered,orchestrator:stateChanged,peers:thresholdReached/belowThreshold,boot:settled.AsyncGateutility unifies the mixed resolver-array / EventEmitter awaitable patterns (waitForDaemonReady,waitForDbReady,waitForBootComplete,waitForConfirmationStatus).setTimeoutrecursion, split intowaitForDaemonSync/pollForNewBlocks/recoverAndRestart.stoppedAppsRecovery→appStartupManager(manageAppsOnBoot/monitorAndRecoverApps); container health monitoring extracted tocontainerHealthMonitor.AppSyncOrchestratorno longer receives fullpeerManager;appSpawnerimportsappInstaller/appUninstallerdirectly.
Testing infrastructure
test-infra/directory: dockerized 16-node test network with daemon stub, external HTTP stub, per-node config generation, single-node andfull-network compose files.
service windows, compound failures, and boundary conditions (53 tests).
explorer:ready/orchestrator:started/spawner:paused/resumed/blockedSSE events for deterministic test synchronization (no moretiming-based sleeps).
wsPingIntervalMs/wsMaxMissedPongs(2s/2 in test config for fast dead-peer detection).Config
~25 timing constants / thresholds / intervals extracted from production code into
config.fluxappswith??fallback defaults. New:maxAppsPerNode: 200enforced by spawner andstoreAppRunningMessage.Test plan
test-infradocker-compose network and confirm all 7 new integration suites passsystemctl restart fluxos)zelappslocationview stay in sync across gossip + sigterm + evictionmessageNotFoundflags on upgrade and confirm cancel/expire messages are fetched