Summary
While piloting an external Bifrost worker in Azure Container Apps, long-lived Redis pub/sub listener sockets were reset periodically even though normal job execution continued to work.
This seems worth hardening in Bifrost because externalized workers may run across managed container/network boundaries where idle TCP handling is stricter than a local Docker Compose network.
Observed behavior
- Compose-host workers stayed connected to Redis without the same errors.
- The external ACA worker could connect to RabbitMQ, consume jobs, run workflows, and report results.
- The external worker repeatedly logged Redis listener reconnects for process-pool control channels:
Cancel listener error: Error while reading from <redis-ip>:6379 : (104, 'Connection reset by peer'); reconnecting in 1s
Command listener error: Error while reading from <redis-ip>:6379 : (104, 'Connection reset by peer'); reconnecting in 1s
- Redis server health looked clean during the resets: no rejected clients, no blocked clients, no Redis crash/restart, and RDB saves completed successfully.
Environment shape
- Redis remained on the Compose VM.
- One worker ran externally in Azure Container Apps over private network connectivity to the VM.
- Redis server default
tcp-keepalive was 300 seconds.
- The relevant worker process-pool code appears to use long-lived pub/sub sockets without explicit socket keepalive or health-check interval.
Local mitigation that appears appropriate for the pilot
We are testing two infrastructure-level mitigations:
- Redis server: lower
tcp-keepalive to 60 seconds.
- External worker Redis URL: add redis-py options via query string:
health_check_interval=60
socket_keepalive=True
Request
Please consider making Redis connection keepalive/health-check behavior first-class for Bifrost worker pub/sub listeners, especially the process-pool command and cancel listeners.
Possible directions:
- set explicit redis-py
health_check_interval / socket_keepalive for long-lived Redis clients;
- expose worker Redis socket keepalive settings via Bifrost config/env vars;
- document the required Redis/network settings for external workers running outside the Compose host;
- add a small reconnect/health metric so operators can distinguish harmless reconnect noise from actual Redis instability.
This is not blocking basic execution in our pilot, but it is noisy and could become operationally significant for managed worker deployments.
Summary
While piloting an external Bifrost worker in Azure Container Apps, long-lived Redis pub/sub listener sockets were reset periodically even though normal job execution continued to work.
This seems worth hardening in Bifrost because externalized workers may run across managed container/network boundaries where idle TCP handling is stricter than a local Docker Compose network.
Observed behavior
Cancel listener error: Error while reading from <redis-ip>:6379 : (104, 'Connection reset by peer'); reconnecting in 1sCommand listener error: Error while reading from <redis-ip>:6379 : (104, 'Connection reset by peer'); reconnecting in 1sEnvironment shape
tcp-keepalivewas300seconds.Local mitigation that appears appropriate for the pilot
We are testing two infrastructure-level mitigations:
tcp-keepaliveto60seconds.health_check_interval=60socket_keepalive=TrueRequest
Please consider making Redis connection keepalive/health-check behavior first-class for Bifrost worker pub/sub listeners, especially the process-pool command and cancel listeners.
Possible directions:
health_check_interval/socket_keepalivefor long-lived Redis clients;This is not blocking basic execution in our pilot, but it is noisy and could become operationally significant for managed worker deployments.