Skip to content

[BUG] ring and jaccl both fail to connect (error 60/65) on 4-node M3 Ultra cluster — IP network verified routable, full mesh confirmed #3755

Description

@rogerstom-lgtm

Description
On a 4-node Mac Studio M3 Ultra cluster with a verified full Thunderbolt 5 mesh, mx.distributed.init() fails to establish connections across all three transports — jaccl and jaccl-ring over Thunderbolt, and plain ring over ethernet — with the same class of error (connection timeout / EHOSTUNREACH), even though the underlying IP network is confirmed fully routable between all node pairs. This reproduces on a trivial 10-element all_sum, so it is not memory-, model-, or load-related.

Environment

4× Mac Studio M3 Ultra (Mac15,14), 256 GB each
macOS 26.4.1 (build 25E253) — uniform across all nodes
MLX 0.31.2 — uniform across all nodes
Python 3.12 (uv-managed venv, per-node)
Full Thunderbolt 5 mesh: all 6 cables present, verified
RDMA enabled via Recovery (rdma_ctl enable); ibv_devices lists all devices
No bridge0 / Thunderbolt Bridge on any node
macOS application firewall disabled on all nodes (socketfilterfw --getglobalstate = disabled on all 4)
EXO present but its io.exo.networksetup daemon disabled; no virtual Thunderbolt interfaces present (confirmed)

What's verified working

mlx.distributed_config --over thunderbolt --dot detects a complete full mesh (all 6 pairs):

a--d [en3/en5] a--c [en4/en5] a--b [en5/en5]
b--d [en3/en4] b--c [en4/en4] c--d [en3/en3]

mlx.distributed_config --auto-setup (both jaccl and ring) completes without error and writes hostfiles
ibv_devinfo shows 3× PORT_ACTIVE per node (full mesh)
IP reachability verified on every relevant path: c1↔c4 and c3↔c4 ping with 0% loss over both the Thunderbolt point-to-point links (192.168.0.x) and the 10GbE management network (10.0.0.x); route get confirms expected interfaces

Failure 1 — JACCL (RDMA over Thunderbolt)
mx.distributed.init(backend="jaccl") fails at init:
[jaccl] Connection attempt 0 waiting 1000 ms
... (backoff 1s/2s/4s/8s)
RuntimeError: [jaccl] Couldn't connect (error: 60)
Persists across: hostfile regeneration via --auto-setup, full reboot of all nodes (to clear any PD exhaustion per mlx-lm#955), re-confirming all ports PORT_ACTIVE, and a clean-boot 10-element all_sum. Note: one TB link was found PORT_DOWN after a reboot and recovered by reseating the cable (all ports PORT_ACTIVE afterward), but JACCL still failed identically after that.
Failure 2 — ring over ethernet (the decisive case)
Switching to plain TCP ring over the flat, routable 10GbE network (--over ethernet; hostfile lists 10.0.0.x addresses that all mutually ping) fails identically:
[ring] Rank 0 accepting
[ring] Rank 0 connecting to 1
[ring] Rank 1 accepting
[ring] Rank 1 connecting to 2
[ring] Rank 2 accepting
[ring] Rank 2 connecting to 3
[ring] Attempt 0 waiting 1000 ms (error: 65 )
[ring] Rank 3 connecting to 0
[ring] Rank 3 accepting
[ring] Attempt 1 waiting 2000 ms (error: 65 )
... (continues backoff)
RuntimeError: [ring] Couldn't connect (error: 60)
The break is consistently at rank 2 → rank 3 and the closing rank 3 → rank 0, with error: 65 (EHOSTUNREACH) — despite c3→c4 and c4→c1 both pinging cleanly at the IP level. Ranks 0→1 and 1→2 connect fine; failures cluster on the later hops including the ring-closing connection.
Reproduction (minimal)
pythonimport mlx.core as mx
world = mx.distributed.init(backend="ring")
x = mx.distributed.all_sum(mx.ones(10))
mx.eval(x)
print(f"rank {world.rank()}/{world.size()} -> {x[0].item()}")
mlx.launch --verbose --backend ring --hostfile ring-eth.json --python /path/to/python ring_test.py
ring-eth.json: 4 hosts, each with its single 10.0.0.x address, generated via mlx.distributed_config --over ethernet --backend ring --auto-setup.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions