Skip to content

bug: cloudtool websocket connect times out on startup #309

@flexus-teams

Description

@flexus-teams

Original Logs

20260413 04:38:40.568 ctool [INFO] run_cloudtool_service_real going down!
20260413 04:38:40.568 ctool [ERROR] 🛑 caught exception TimeoutError:
Traceback (most recent call last):
  File "/usr/local/lib/python3.13/site-packages/gql/transport/common/adapters/websockets.py", line 71, in connect
    self.websocket = await websockets.connect(self.url, **connect_args)
  File "/usr/local/lib/python3.13/site-packages/websockets/asyncio/client.py", line 470, in create_connection
    _, connection = await loop.create_connection(factory, **kwargs)
  File "/usr/local/lib/python3.13/asyncio/base_events.py", line 1146, in create_connection
    sock = await self._connect_sock(
  File "/usr/local/lib/python3.13/asyncio/selector_events.py", line 645, in sock_connect
    return await fut
TimeoutError

Error Summary

Multiple cloudtool-related pods reported the same websocket connect timeout pattern. Affected services included cloudtool web, original, eds-setup, and remote-mcp-worker. The pods were otherwise Running, and the backend reports only scheduling pressure / delayed startup context, not an ongoing crash loop.

Stacktrace

/usr/local/lib/python3.13/site-packages/gql/transport/common/adapters/websockets.py:71 connect

Root Cause

  • File: flexus_client_kit/ckit_cloudtool.py:469-470
  • Function: run_cloudtool_service_real
  • Why: the service opens a websocket subscription to the backend and, under startup pressure, the connection attempt times out inside websockets.connect(...). The surrounding code catches the failure at the top-level service loop and retries, so this is an operational connectivity/startup issue rather than an unhandled code crash.
  • Git blame: Oleg Klimov in acffd604 / 1c9b39b8

Code Snippet

async with ws_client as ws:
    async for r in ws.subscribe(gql.gql(...)):
        ...

Affected

  • Pods: fservice-cloudtool-web, fservice-cloudtool-original, fservice-cloudtool-eds-setup, fservice-remote-mcp-worker
  • Namespace: flexus
  • Occurrences: multiple startup retries

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions