Skip to content

fix: dial schedulers and peers over https when TLS is configured#1859

Open
rtainaan wants to merge 1 commit into
dragonflyoss:mainfrom
rtainaan:fix/scheduler-peer-dial-tls-scheme
Open

fix: dial schedulers and peers over https when TLS is configured#1859
rtainaan wants to merge 1 commit into
dragonflyoss:mainfrom
rtainaan:fix/scheduler-peer-dial-tls-scheme

Conversation

@rtainaan
Copy link
Copy Markdown

Description

The scheduler and peer dial paths build their gRPC channel from an http:// URL, so tonic dials cleartext even when a TLS config is attached. A dfdaemon configured for mTLS then connects in cleartext and fails against a TLS-enforcing server (tls: first record does not look like a TLS handshake), and the host-announce/delete paths never attach a TLS config at all. This selects the https scheme when client TLS is configured and routes the scheduler dials through a single helper so announce and delete get the same mTLS handling as the request path.

Motivation and Context

With mTLS enabled between dfdaemon and the scheduler/peers, the TLS config is loaded but ignored on these paths today — the scheduler and upload-client certificates have no effect and traffic falls back to cleartext.

tonic derives a channel's transport security from the URL scheme, not from
whether tls_config() was set: an "http://" endpoint dials cleartext even
with a TLS config attached. The scheduler- and peer-dialing paths hardcoded
the "http" scheme, so a client configured with a CA cert connected in
cleartext and the handshake failed against a TLS-enforcing server. The
host-announce and host-delete paths went further and never attached a TLS
config at all.

Add a net::scheme_for_tls helper that selects the scheme from a fully
configured client TLS identity (ca_cert, cert, and key all present, the same
predicate load_client_tls_config uses), and apply it at each dial site: the
scheduler address in dynconfig, and the peer upload dials in parent_selector
and piece_collector. Gating on ca_cert alone would emit an https URL with no
TLS config and fail the dial.

For the scheduler gRPC client, consolidate the four channel-construction
sites into one connect_to_scheduler() helper that derives the scheme the same
way and applies load_client_tls_config() uniformly, so the announce and
delete paths get the same mTLS treatment as the request path.

Tests:
  cargo fmt --all -- --check                       clean
  cargo clippy --all --all-targets -- -D warnings  clean
  cargo test -p dragonfly-client-util net::tests   6 passed

Signed-off-by: Raymond (Kai) Tainaan <282025710+rtainaan@users.noreply.github.com>
@pmady
Copy link
Copy Markdown
Contributor

pmady commented May 31, 2026

good catch — the announce and delete paths at scheduler.rs L188/L242/L301 never loaded TLS config at all, which is the more serious side of this. did you test end to end with mTLS between dfdaemon and scheduler? the scheme switch looks right from the code but just want to make sure tonic picks up the https:// correctly with the config attached

@rtainaan
Copy link
Copy Markdown
Author

Yeah I’m surprised no one noticed before. I was verifying end to end mtls when I caught it.

And yep, tested this as a custom patch, dfdaemon handshakes with the scheduler now, fails without it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants