Skip to content

v8.13.0#1738

Merged
Cabecinha84 merged 401 commits into
masterfrom
development
May 26, 2026
Merged

v8.13.0#1738
Cabecinha84 merged 401 commits into
masterfrom
development

Conversation

@Cabecinha84

@Cabecinha84 Cabecinha84 commented May 25, 2026

Copy link
Copy Markdown
Member

v8.13.0

A major release focused on architectural overhaul of app state propagation, hash synchronization, node lifecycle management, and startup
orchestration. ~100 commits, +23.7k / -3k lines across 100+ files.

Highlights

App state event log — Replaces the dual-collection fluxapprunningbroadcasts + zelappslocation model with a single append-only
appstateevents log as source of truth (apprunning, sigterm, appremoved, evicted, ipchanged). zelappslocation is now a materialized cache derived
from the event log via aggregation, eliminating the orphaned-entry class of bugs where the two collections drifted out of sync. TTL switched from
mutating signed broadcastedAt to operational expireAt field.

Node confirmation service — New service with three-level status tracking (isConfirmed / canSendMessages / isDaemonStale). Outbound signed
messages, peering, and hash sync are gated on confirmation status. Daemon staleness >125min triggers app removal; >320min flips confirmation off.
Replaces 326k/day log-spam loop on expired nodes.

Hash sync rewrite — Multi-peer targeted requests (3 peers/round with poll-until-settled, event-driven 4s settle window), bulk threshold lowered
to 500, exponential backoff (0/50min/4h/21h/4d/17d/35d → permanent after ~1yr), ephemeral peer connections to deterministic node list as fallback,
and a fast bootstrap path via daemon address index (getaddresstxids + batch getrawtransaction) that cuts initial explorer sync from ~9.5h to
~4min.

FluxOS-managed container startup — Single ownership model replacing the split where Docker auto-started on powercut and FluxOS managed on clean
shutdown. Container restart policy default → no; FluxOS now owns all startup decisions. Boot context (heartbeat with machineBootId, shutdown
reason on SIGTERM) drives reconciliation: FluxOS restart skips recovery, expired locations trigger immediate removal, 5-min sync timeout removes
apps.

Orchestrator state machine — Formalized states (INITIALIZING / SYNCING / RESYNCING / READY / DEGRADED) with deterministic transitions. Boot path
gates: daemonReadyconfirmeddbReadybootContainerStateSettled. Block-driven hash retry scheduling replaces the 30h reconstruct-tied
cycle. Peer loss during SYNCING/READY transitions to DEGRADED.

Signed sync requests — Binary frame extended with requestTimestamp + pubkey + signature (0x20-0x23 opcodes). Handlers verify identity
before opening MongoDB cursors, preventing unauthenticated peers from triggering expensive server-side work.

Performance

  • processMessages: batch existence checks via single $in per 2000-message chunk, eliminated duplicate verify/read passes, batch insertMany +
    bulkWrite. ~58k individual ops → ~29 batch ops on bulk sync.
  • Removed unindexed zelAppSpecifications full-collection scan (legacy Zel→Flux rebrand, 0 results) — saved ~22 min per full hash sync.
  • appLocationFromEvents view: optimized aggregation (~2900ms → ~26ms for targeted queries) with name filter pushed into facet sub-pipelines.
  • Reconstruct audit: single bulkWrite + updateMany aggregation replacing 58k+ individual updateOne calls.
  • Eliminated 2-min blind daemon-wait at startup via waitForDaemonRpc.

Bug fixes

  • 5 hash-sync signature verification edge cases: v7 marketplace team support address swap, enterprise v8 usersToExtend on non-ArcaneOS, missing
    prevSpec decryption in processMessages, owner-change race (height-gated <2M for legacy network behavior).
  • Zombie apps: updateAppSpecifications split into insert (upsert) + update (no upsert) so the cache-update path can't resurrect
    cancel/expire-deleted entries. Reconstruct cycle now invalidates hash sync via hashesReconstructed event so newly-eligible hashes get retried.
  • prevSpecsMap uses height-aware lookup for re-registered apps (was returning newest-by-name, picking wrong owner across registration cycles).
  • Sigterm event TTL extended to 125min so it outlives the apprunning events it suppresses (was 7min — apps reappeared after sigterm TTL'd).
  • messageNotFound block threshold corrected for 30s post-PON blocks (* 12* 48).
  • Dead peer detection uses ws.terminate() instead of ws.close() (~33s → ~4s).
  • Daemon info poll: setInterval → self-scheduling setTimeout to prevent concurrent RPCs.

Architecture / refactors

  • Broke circular dependencies: TTL constants moved from messageStoreappConstants; serialiseAndSignFluxBroadcast extracted to
    fluxBroadcastHelper; deleteLoginPhrase moved from serviceHelperidService.
  • appSyncEvents event bus replaces mutable module state setters (setOnSyncComplete, EventEmitter inheritance, ad-hoc thunks).
  • fluxEventBus publishes confirmation:changed, daemon:unreachable/recovered, orchestrator:stateChanged,
    peers:thresholdReached/belowThreshold, boot:settled.
  • AsyncGate utility unifies the mixed resolver-array / EventEmitter awaitable patterns (waitForDaemonReady, waitForDbReady,
    waitForBootComplete, waitForConfirmationStatus).
  • Block processor: eliminated self-referential setTimeout recursion, split into waitForDaemonSync / pollForNewBlocks / recoverAndRestart.
  • stoppedAppsRecoveryappStartupManager (manageAppsOnBoot / monitorAndRecoverApps); container health monitoring extracted to
    containerHealthMonitor.
  • Narrowed module interfaces: AppSyncOrchestrator no longer receives full peerManager; appSpawner imports appInstaller/appUninstaller
    directly.

Testing infrastructure

  • New test-infra/ directory: dockerized 16-node test network with daemon stub, external HTTP stub, per-node config generation, single-node and
    full-network compose files.
  • 7 new integration test suites covering orchestrator state machine transitions, boot manager decision tree, spawner gate conditions, confirmation
    service windows, compound failures, and boundary conditions (53 tests).
  • explorer:ready / orchestrator:started / spawner:paused/resumed/blocked SSE events for deterministic test synchronization (no more
    timing-based sleeps).
  • WS ping/pong intervals configurable via wsPingIntervalMs / wsMaxMissedPongs (2s/2 in test config for fast dead-peer detection).

Config

~25 timing constants / thresholds / intervals extracted from production code into config.fluxapps with ?? fallback defaults. New:
maxAppsPerNode: 200 enforced by spawner and storeAppRunningMessage.

Test plan

  • Bootstrap a fresh node from scratch — verify hash sync completes via daemon address index (~4min target) without ~9.5h block-by-block fallback
  • Run 16-node test-infra docker-compose network and confirm all 7 new integration suites pass
  • Verify FluxOS-restart boot path skips app recovery (preserves running containers across systemctl restart fluxos)
  • Verify unclean-shutdown / powercut boot path correctly reconciles via FluxOS rather than Docker auto-start
  • Confirm event log + materialized zelappslocation view stay in sync across gossip + sigterm + eviction
  • Validate node confirmation gate: unconfirmed nodes don't peer, don't send signed messages, but still receive passive gossip
  • Stress test: induce peer drop during SYNCING/RESYNCING and confirm DEGRADED transition + recovery to READY
  • Verify zombie-app recovery: simulate stale messageNotFound flags on upgrade and confirm cancel/expire messages are fetched

MorningLightMountain713 and others added 30 commits May 14, 2026 13:38
Three changes to eliminate orphaned entries between collections:

1. break → continue in storeAppRunningMessage loop: for v2 messages
   with multiple apps, skip apps that already have current data but
   keep processing the rest. Previously broke out of the entire loop.

2. storeAppRunningMessage returns { stored, rebroadcast } instead of
   true/false. The gossip handler only calls storeSignedAppRunningBroadcast
   when stored is true, ensuring both collections accept or reject together.

3. Remove redundant 5-minute gossip validity check from
   storeSignedAppRunningBroadcast — it's now gated on the location
   store's acceptance, eliminating the timing edge where one store
   accepts at the boundary and the other rejects milliseconds later.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The sigterm handler was mutating broadcastedAt on location records to
force 7-minute TTL expiry. This broke the data contract — broadcastedAt
is derived from signed data and should never change. Stale gossip could
also overwrite the sigterm by passing the "is newer" check against the
fake broadcastedAt value.

Switch all 6 ephemeral collections to expireAt-based TTL (expireAt:0).
expireAt is operational metadata we control, not part of the signed
payload. Sigterm now sets expireAt = now + 7min on both locations and
signed broadcasts without touching broadcastedAt.

Also: split gossip validity (5min) from record expiry into named
constants, add missing expireAt to error stores, fix empty-apps v2
handler to clean up signed broadcasts with broadcastedAt guard.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nodeStatusMonitor and storeAppRemovedMessage deleted from
zelappslocation without touching fluxapprunningbroadcasts, leaving
orphaned signed broadcasts (~44 per 20-minute monitor cycle).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- storeAppRemovedMessage: $addToSet excludedApps on v2 broadcast docs
  so the derived view skips removed apps without mutating signed data
- storeSignedAppRunningBroadcast + batch sync: $unset excludedApps
  when a newer broadcast upserts (clears stale exclusions)
- appLocationFromBroadcasts: filter out excluded apps after v2 unwind
- reindexGlobalAppsLocation: also drop running broadcasts collection
- explorer rescan: also drop running + installing broadcasts
- Export handleMissingMasterSlaveContainer from stoppedAppsRecovery
- Fix all 10 CI test failures, add excludedApps tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Single `appstateevents` collection replaces `fluxapprunningbroadcasts`
as the source of truth. Five event types (apprunning, sigterm,
appremoved, evicted) with dedupKey-based upserts and $cond timestamp
guards. `zelappslocation` stays populated as materialized cache.

- storeAppStateEvent() dispatcher with APP_STATE_EVENT_TYPES enum
- storeBatchAppRunningEvents() for sync receiver
- Gossip handler writes event unconditionally, then materializes location
- Sigterm/appremoved/evicted all append events instead of mutating
- Sync sender/receiver stream from event log
- Remove storeSignedAppRunningBroadcast, excludedApps, gossip gating
- 99 tests passing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The view now filters appremoved, sigterm, and evicted events, excludes
stale v1 broadcasts superseded by newer v2, and correctly handles
expired shutdown events. Verified against charlie live data (0 diff).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The event store was accepting gossip up to 125min old (RUNNING_EXPIRY_MS)
instead of 5min (GOSSIP_VALIDITY_MS). Only the batch sync path should
accept older messages.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nodeStatusMonitor deletes locations immediately on eviction, but the
view was giving evicted IPs the same 7-minute grace period as sigterm.
Eviction should be immediate — the monitor already verified the node
is gone. Also extend eviction TTL to match apprunning (125min) so the
eviction event outlives the apprunning events it suppresses.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
storeSignedAppRunningBroadcast no longer exists — stub storeAppStateEvent
instead. Sigterm handler now calls updateInDatabase once (location expiry
only) not twice, and storeAppStateEvent needs stubbing to prevent throw.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix undefined appsRunningBroadcasts in apiServer.js sigterm handler,
  add storeAppStateEvent(SIGTERM) call for own shutdown
- Escape regex in appLocationFromBroadcasts to prevent injection
- Cap sync response batch size at 2500 in all 4 handlers
- Add IPCHANGED event type with view remapping so IP changes are
  reflected in the event log view
- Await all storeAppStateEvent calls (was fire-and-forget)
- Use ?? instead of || for config fallbacks in orchestrator
- Optimise appLocationFromBroadcasts pipeline: $arrayToObject/$getField
  for O(1) lookups instead of $filter scans (2900ms → 118ms), push
  name filter into facet sub-pipelines (2666ms → 26ms for targeted)
- Standardise $gt (not $gte) for "only if newer" guards
- Add {createdAt: 1} index for sync sender evicted event queries
- Hash sync failure recovery: retry 3x with 5-min gap, block timer
  fallback if retries exhausted, background 20-min recheck on
  blockReceived for missing hashes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests cover: retry on failure, block timer fallback when retries
exhausted, readiness via block timer when hash sync never completes,
and DB rebuild failure not blocking the state machine.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract streamBatchedSync helper from 3 nearly identical respondWith*
functions. Rename MIN_SYNC_PEERS to MIN_SYNC_COMPLETIONS for clarity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
$getField with dynamic field references requires MongoDB 7.2+
(SERVER-74371). CI runs 7.0. Replaced $arrayToObject/$getField O(1)
maps with $filter/$first lookups against small arrays. Structural
optimization preserved: shutdown/v1 filtering at IP level before
unwinding. Estimated ~200-300ms at full scale vs 118ms with $getField
vs 2900ms with the original post-unwind approach.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- handleAppRunningEvent: reject empty-apps v2 when no prior events
  exist for that IP (matches location store behavior independently)
- handleNodeSigtermMessage: check event log for app events instead
  of zelappslocation, so sigterm handling works without locations

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rename to reflect event log architecture (broadcasts no longer exist).
Change signature from positional appname to options object { appname, ip }
to support IP filtering. Sigterm handler now uses the full view derivation
to check for apps instead of a naive event log findOne.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Stores the time each node received/processed the event, alongside the
original broadcastedAt from the source node. The delta reveals gossip
propagation latency and helps diagnose messages that arrive near the
5-minute validity boundary.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Gossip path sets receivedAt on insert. Batch sync path preserves the
sender's receivedAt so the original gossip reception time is retained
across sync. Enables propagation latency diagnostics on installing
and install error broadcasts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sigterm events had 7-min TTL matching the grace period, but apprunning
events have 125-min TTL. After the sigterm TTL'd away, apps reappeared
in the view with nothing to suppress them. Same race as the evicted
TTL bug.

Fix: sigterm event expireAt uses RUNNING_EXPIRY_MS (125 min) so the
document outlives every apprunning it suppresses. The 7-min grace
period is computed from eventAt in the view pipeline, not from expireAt.
Export SIGTERM_EXPIRY_MS and use it in fluxCommunication.js and
apiServer.js instead of hardcoded 420*1000.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The previous hash sync sent fluxapprequest to a single random peer per
attempt, with a fixed 30s wait that couldn't cover the 75s response time
for 500 hashes (150ms per hash on the responder). It also broke out on
zero progress and reused the same peers.

New algorithm:
- Bulk threshold lowered from 1000 to 500 (matching fluxapprequest v2 cap)
- Targeted path sends to 3 peers per round with poll-until-settled
- Timeout proportional to hash count (count × 150ms + 5s buffer)
- Settle detection: exits early when no new responses for 4s
- Tracks tried peers — never repeats across rounds
- Continues through all rounds regardless of per-round progress
- Excludes deterministic peers (same-provider neighbors)
- Bulk path aggregates responses from all peers instead of picking largest

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Moves the broadcast signing logic out of fluxCommunicationMessagesSender
into utils/fluxBroadcastHelper. This breaks the circular dependency that
prevented appHashSyncService from sending signed messages to peers
(messageStore → messageVerifier → fluxCommunicationMessagesSender).

appHashSyncService now uses fluxBroadcastHelper directly to sign and
send fluxapprequest messages via peer.send().

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The cycle messageVerifier → registryManager → messageStore → messageVerifier
caused messageVerifier exports to be empty at load time, breaking
checkAppMessageExistence for gossip message handling.

Root cause: registryManager imported SIGTERM_EXPIRY_MS from messageStore,
which created a circular require chain during module initialization.

Fix: Move all TTL/expiry constants (GOSSIP_VALIDITY_MS, RUNNING_EXPIRY_MS,
INSTALLING_EXPIRY_MS, INSTALLING_ERRORS_EXPIRY_MS, SIGTERM_EXPIRY_MS,
EVICTED_EXPIRY_MS) from messageStore to appConstants. Update all consumers
to import from appConstants instead.

Also extracts serialiseAndSignFluxBroadcast into utils/fluxBroadcastHelper
to cleanly separate broadcast signing from peer routing logic.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The ephemeral sync receiver stored appremoved/sigterm/evicted events to
the event log but didn't apply their location side-effects. This caused
syncing nodes to have stale locations that the sender had already deleted.

- appremoved: delete location entry for {ip, appName}
- sigterm: update expireAt on all locations for that IP
- evicted: delete all locations for that IP

Also gates ephemeral sync on network state readiness — the orchestrator
now requires both peer threshold AND node list populated before firing
sync requests. Prevents verification failures from unloaded node list.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
apiServer.handleSigterm used the old { ip, broadcastedAt, envelope }
format and referenced messageStore.SIGTERM_EXPIRY_MS which was moved
to appConstants. Updated to pass { message, envelope } so the full
signed payload is stored for sync re-verification.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sigterm, appremoved, and ipchanged event handlers were stripping type
and version fields from the stored data. When these events were synced
to another node, re-verification failed because the signature was
computed over the original full message, not the stripped version.

Now stores the complete message object as data so envelope + data
can be reconstructed for verification during sync.

Also updates all callers to pass { message, envelope } instead of
individual fields.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add TTL constants to appConstants stub (moved from messageStore)
- Update sigterm test to use { message, envelope } format

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
processMessages now checks permanent message existence in chunks of
2000 using a single $in query instead of individual findOne per
message. Existing hashes are batch-marked as message:true via
bulkWrite. Only genuinely new messages go through the sequential
storeAppTemporaryMessage + checkAndRequestApp path.

For the common case (most messages already exist), this reduces
~58k individual DB reads + writes to ~29 batch operations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The update path in checkAndRequestApp did two queries per message to
find the latest permanent message for an app name:
1. find({appSpecifications.name}) — loaded all docs, iterated in JS
2. find({zelAppSpecifications.name}) — full collection scan (no index,
   0 results — legacy field from Zel→Flux rebrand, never populated)

Combined cost: ~48ms per message on 35k-doc collection. For 58k
messages during bulk hash sync, this added ~22 minutes of pure waste.

Fix:
- Remove zelAppSpecifications query entirely (dead code)
- Replace find-all + JS iterate with findOne using sort:{height:-1}
  which leverages the existing {appSpecifications.name:1, height:-1}
  compound index

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces the storeAppTemporaryMessage + checkAndRequestApp per-message
flow with a single-pass verify-and-batch-insert approach:

- Skips temp message storage entirely (was write + immediate read-back)
- Eliminates 3 duplicate DB reads per message (existence checks done
  twice, getPreviousAppSpecifications done twice)
- Eliminates duplicate signature verification
- Pre-loads previous app specs for update messages per chunk (one $in
  query replaces N individual find-all queries)
- Batch inserts permanent messages via insertMany
- Batch marks hashes via bulkWrite
- Keeps: hash verification, signature verification, app spec validation,
  price validation, name conflict checks for registers

Also removes dead zelAppSpecifications query in messageVerifier
checkAndRequestApp (unindexed full collection scan, 0 results).
Replaces find-all + JS iterate with indexed findOne for update path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
validatePrice called appPricePerMonth (async) without await, causing
price comparisons against Promise objects. Also restores
specificationFormatter for consistent spec formatting before
signature verification.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Registrations verified within a chunk are added to the prevSpecsMap
so that updates later in the same chunk can find their previous
specs without a DB round-trip. Eliminates the 30% failure rate where
updates couldn't find registrations from the same chunk.

The map is pre-loaded from DB per chunk (for cross-chunk lookups)
and grown as registrations are verified. Memory bounded by unique
app names per chunk.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MorningLightMountain713 and others added 22 commits May 20, 2026 20:58
Adds nodeConfigOverrides option to createTestEnv — a map of node index
to config that merges on top of the global configOverrides. This allows
setting different config on specific nodes, e.g. appSyncMinCompletions=3
only on the joining node without affecting source nodes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Only one joining node is needed. Also set appSyncPeerThreshold=3 so
the peer threshold fires after 3 peers connect, matching the
appSyncMinCompletions=3 requirement.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Health check timeout (5s) exceeded interval (3s), causing Docker's
health state machine to produce spurious "unhealthy" on container
restart. Reduced timeout to 2s across all container health checks.

Docker's CloseMonitorChannel sets health status to "unhealthy" during
monitor teardown (moby/daemon/container/health.go:80). On restart,
HealthCheckWaitStrategy sees this transient state and destroys the
container. Replaced restartNode to swap in an HTTP-polling wait
strategy that bypasses Docker's health state machine entirely.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bulk permanent message fetch now partitions missing hashes across peers
and streams in parallel via Promise.allSettled, instead of sequential
single-peer streaming. Each stream maintains its own 500-message
backpressure — peak memory is ~1500 messages vs 500 previously.

Targeted fetch and ephemeral rounds now chunk hashes into groups of 500
before calling broadcastHashRequest, fixing a latent bug where >500
hashes would exceed the fluxapprequest v2 message cap.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Parallel bulk fetch caused ~10-25% failure rate per batch because
update messages couldn't find predecessor specs processed on other
streams. Reverted to sequential streaming which maintains height
ordering across all messages.

Kept the broadcastHashRequest chunking at 500 for targeted fetch
rounds and ephemeral rounds (latent bug fix).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The first checkAndNotifyPeersOfRunningApps call was triggered by
peer threshold, before appStartupManager finished reconciling
containers. This caused the broadcast to report 0 apps because
Docker containers hadn't been started yet. The next broadcast
wouldn't fire for an hour (peerNotifyIntervalMs).

Gate the first broadcast behind waitForBootContainerStateSettled()
so it runs after reconciliation completes and Docker state is
accurate.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The first broadcast was racing with appStartupManager and reporting 0
apps. This test verifies the app:running SSE event includes the
reconciled app after a simulated reboot, catching the race if the
broadcast gate on boot:settled is removed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
If the HTTP poll times out, log a warning instead of throwing.
Throwing triggers testcontainers' waitForContainer error handler
which destroys the container, making the failure undiagnosable.
The test's own assertions will catch the actual problem.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
appHashSyncService.js: messageStore, globalState
appStartupManager.js: decryptEnterpriseApps, appUsesGSyncthingMode
serviceManager.js: hashSyncIntervalMs, peerNotifyIntervalMs, locationTtlS,
  installingTtlS, installErrorTtlS, removalSpacingMs (dead — old interval
  logic moved to orchestrator)
nodeStatusMonitor.js: fluxEventBus
messageVerifier.js: scannedHeightCollection

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prevents a re-processed registration from overwriting a newer update
spec. Mirrors the existing guard in updateAppSpecifications.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Documents the known divergence between secure and non-secure nodes
for enterprise usersToExtend updates, and the planned resolution
via Arcane attestations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The return value of apps.filter() was discarded, causing
already-resolved apps to be re-requested via
checkAndRequestMultipleApps. Idempotent but wasteful.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
chunk.toString() can corrupt multi-byte UTF-8 characters split
across chunk boundaries. StringDecoder buffers incomplete characters
across writes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pre-existing test fixture private key, not introduced by this PR
but file was modified. Added to GitGuardian ignored_paths.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The broadcast gate change requires the globalState stub to provide
waitForBootContainerStateSettled, otherwise the broadcast promise
never resolves and the test fails.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace .catch(() => {}) with warnings that include the network
name and component. Silent swallowing masked resource leaks that
caused intermittent failures in later suites.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: event-driven app state sync with event log
An app owner can link an app to other apps by embedding a token in the
app description text: networkWith:[appA,appB] (brackets required, quotes
optional, key case-insensitive, comma separated). This is purely
node-local behaviour — no app specification field, no validation change,
no network consensus impact.

When the token is present:
- Before install/redeploy, the node verifies every named app is
  installed locally and owned by the same owner; otherwise the operation
  fails.
- Each of the app's component containers is attached to the private
  docker network of every linked app (fluxDockerNetwork_<linked>), so it
  can reach that app's components by docker DNS name
  flux<component>_<linkedApp>, as if both apps were a single app.
- When a linked-to app is (re)deployed, any locally installed app that
  is networked with it is reconnected to its network.

New module appNetworkLinker.js holds the parser, the install gate, and
the forward/reverse network wiring. The gate and forward wiring run in
installApplicationHard/installApplicationSoft (the only callers of
appDockerCreate), so every container-creation path is covered, including
direct callers that bypass registerAppLocally (container health recovery
and legacy v<=3 redeploys). Reverse wiring runs in registerAppLocally and
softRegisterAppLocally; a boot-time reconcile sweep re-applies all links.

dockerService gains an idempotent appDockerNetworkConnect helper.

Adds tests/unit/appNetworkLinker.test.js (parser, gate, wiring,
reconcile) and appDockerNetworkConnect coverage in dockerService.test.js.
- extract APP_NAME_REGEX (v8+) and APP_NAME_REGEX_LEGACY (v<=7 / components)
  into appConstants; consume from appValidator and appNetworkLinker
- move getAppContainerNames / getAppContainerObjects into dockerService;
  anchor the multi-component match to ^(?:flux|zel)[a-zA-Z0-9]+_<app>$ and
  escape regex metacharacters in the app name; refactor
  getNextAvailableIPForApp to use the same helper
- rewrite appDockerNetworkConnect to inspect the container's
  NetworkSettings.Networks first and skip the connect when already
  attached; drop the blanket 403 catch (overloaded by docker) in favour of
  a narrow already-exists message match as a TOCTOU race fallback
- update affected unit tests
When a SEND component is being installed in an app whose own compose has no
LOG=COLLECT component, walk every app it is networkWith-linked to and ship
to the first linked app that exposes a collector. Reachability is provided
by the existing networkWith wiring (sender's container is already attached
to the linked app's private docker network).

Enterprise linked apps whose compose is blanked in the local DB and cannot
be decrypted on this node are skipped — the SEND container falls back to
json-file logging with a warning. Same fallback applies if the collector
container is not reachable at install time.

- new appNetworkLinker.findLinkedAppLogCollector(fullAppSpecs) that resolves
  the linked app + component name (handles the legacy enviromentParameters
  typo too)
- appDockerCreate calls it as a fallback after the existing in-compose
  collector lookup, only for SEND components
feat: app-to-app network linking via networkWith description token
@gitguardian

gitguardian Bot commented May 25, 2026

Copy link
Copy Markdown

⚠️ GitGuardian has uncovered 2 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
32907135 Triggered Generic Private Key 83c22a1 test-infra/fixtures/registry-tls/server-key.pem View secret
10071586 Triggered Generic High Entropy Secret 0da0c94 tests/unit/fluxCommunicationMessagesSender.test.js View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secrets safely. Learn here the best practices.
  3. Revoke and rotate these secrets.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

@alihm alihm left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

@MorningLightMountain713 MorningLightMountain713 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

@Cabecinha84 Cabecinha84 merged commit 3a48aa1 into master May 26, 2026
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants