Skip to content

Redis pub/sub sockets reset when externalizing workers to managed container services #242

@MTG-Thomas

Description

@MTG-Thomas

Summary

While piloting an external Bifrost worker in Azure Container Apps, long-lived Redis pub/sub listener sockets were reset periodically even though normal job execution continued to work.

This seems worth hardening in Bifrost because externalized workers may run across managed container/network boundaries where idle TCP handling is stricter than a local Docker Compose network.

Observed behavior

  • Compose-host workers stayed connected to Redis without the same errors.
  • The external ACA worker could connect to RabbitMQ, consume jobs, run workflows, and report results.
  • The external worker repeatedly logged Redis listener reconnects for process-pool control channels:
    • Cancel listener error: Error while reading from <redis-ip>:6379 : (104, 'Connection reset by peer'); reconnecting in 1s
    • Command listener error: Error while reading from <redis-ip>:6379 : (104, 'Connection reset by peer'); reconnecting in 1s
  • Redis server health looked clean during the resets: no rejected clients, no blocked clients, no Redis crash/restart, and RDB saves completed successfully.

Environment shape

  • Redis remained on the Compose VM.
  • One worker ran externally in Azure Container Apps over private network connectivity to the VM.
  • Redis server default tcp-keepalive was 300 seconds.
  • The relevant worker process-pool code appears to use long-lived pub/sub sockets without explicit socket keepalive or health-check interval.

Local mitigation that appears appropriate for the pilot

We are testing two infrastructure-level mitigations:

  • Redis server: lower tcp-keepalive to 60 seconds.
  • External worker Redis URL: add redis-py options via query string:
    • health_check_interval=60
    • socket_keepalive=True

Request

Please consider making Redis connection keepalive/health-check behavior first-class for Bifrost worker pub/sub listeners, especially the process-pool command and cancel listeners.

Possible directions:

  • set explicit redis-py health_check_interval / socket_keepalive for long-lived Redis clients;
  • expose worker Redis socket keepalive settings via Bifrost config/env vars;
  • document the required Redis/network settings for external workers running outside the Compose host;
  • add a small reconnect/health metric so operators can distinguish harmless reconnect noise from actual Redis instability.

This is not blocking basic execution in our pilot, but it is noisy and could become operationally significant for managed worker deployments.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions