Redis pub/sub sockets reset when externalizing workers to managed container services

## Summary

While piloting an external Bifrost worker in Azure Container Apps, long-lived Redis pub/sub listener sockets were reset periodically even though normal job execution continued to work.

This seems worth hardening in Bifrost because externalized workers may run across managed container/network boundaries where idle TCP handling is stricter than a local Docker Compose network.

## Observed behavior

- Compose-host workers stayed connected to Redis without the same errors.
- The external ACA worker could connect to RabbitMQ, consume jobs, run workflows, and report results.
- The external worker repeatedly logged Redis listener reconnects for process-pool control channels:
  - `Cancel listener error: Error while reading from <redis-ip>:6379 : (104, 'Connection reset by peer'); reconnecting in 1s`
  - `Command listener error: Error while reading from <redis-ip>:6379 : (104, 'Connection reset by peer'); reconnecting in 1s`
- Redis server health looked clean during the resets: no rejected clients, no blocked clients, no Redis crash/restart, and RDB saves completed successfully.

## Environment shape

- Redis remained on the Compose VM.
- One worker ran externally in Azure Container Apps over private network connectivity to the VM.
- Redis server default `tcp-keepalive` was `300` seconds.
- The relevant worker process-pool code appears to use long-lived pub/sub sockets without explicit socket keepalive or health-check interval.

## Local mitigation that appears appropriate for the pilot

We are testing two infrastructure-level mitigations:

- Redis server: lower `tcp-keepalive` to `60` seconds.
- External worker Redis URL: add redis-py options via query string:
  - `health_check_interval=60`
  - `socket_keepalive=True`

## Request

Please consider making Redis connection keepalive/health-check behavior first-class for Bifrost worker pub/sub listeners, especially the process-pool command and cancel listeners.

Possible directions:

- set explicit redis-py `health_check_interval` / `socket_keepalive` for long-lived Redis clients;
- expose worker Redis socket keepalive settings via Bifrost config/env vars;
- document the required Redis/network settings for external workers running outside the Compose host;
- add a small reconnect/health metric so operators can distinguish harmless reconnect noise from actual Redis instability.

This is not blocking basic execution in our pilot, but it is noisy and could become operationally significant for managed worker deployments.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redis pub/sub sockets reset when externalizing workers to managed container services #242

Summary

Observed behavior

Environment shape

Local mitigation that appears appropriate for the pilot

Request

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Redis pub/sub sockets reset when externalizing workers to managed container services #242

Description

Summary

Observed behavior

Environment shape

Local mitigation that appears appropriate for the pilot

Request

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions