Skip to content

feat: 20.5 grpc removal — binary protocol is sole transport#162

Merged
vieiralucas merged 15 commits intomainfrom
feat/20.5-grpc-removal
Apr 3, 2026
Merged

feat: 20.5 grpc removal — binary protocol is sole transport#162
vieiralucas merged 15 commits intomainfrom
feat/20.5-grpc-removal

Conversation

@vieiralucas
Copy link
Copy Markdown
Member

@vieiralucas vieiralucas commented Mar 31, 2026

Summary

  • Removed gRPC server (tonic), admin/auth service layers, service.proto, admin.proto
  • Binary protocol is now the sole transport for all client, CLI, and cluster communication
  • Removed ~3,400 lines of gRPC code
  • Kept cluster.proto only for Raft entry serialization (prost still used by proto_convert.rs)
  • Fixed SDK consume to not block waiting for initial response (caused benchmark hang)
  • Bench server simplified to single port (no more dual gRPC + binary)

Net code change: -3,087 lines

Test plan

  • All 348 fila-core tests pass (including cluster)
  • All 15 fila-server binary protocol tests pass
  • All 27 fila-fibp codec tests pass
  • cargo clippy --workspace -- -D warnings clean
  • cargo fmt --check clean
  • E2e tests — CI will verify
  • Benchmarks — CI will verify (local hang being investigated)

🤖 Generated with Claude Code


Summary by cubic

Removed all gRPC code and moved all client, CLI, and cluster traffic to the binary protocol on a single port. Finalized leader-aware routing and a non-blocking SDK consume pipeline with application-level buffering and a no-deadlock guarantee.

  • Refactors

    • Removed tonic servers and admin/auth layers; deleted service.proto and admin.proto; dropped tonic, tower, and http deps.
    • Kept cluster.proto only for Raft serialization; switched to prost/prost-build with include!(env!("OUT_DIR")).
    • Replaced cluster RPCs with FIBP handlers; unified to one client/CLI address across fila-core, fila-server, fila-bench, fila-e2e, and fila-sdk.
    • Reintroduced consume leader redirect using leader-hint errors; simplified redirect by cloning ConnectOptions.
    • SDK consume path: two-stage delivery with an overflow buffer (no TCP read pausing to avoid response starvation); CancelConsume on stream drop; added delivery_buffer_size (test-only) and delivery_internal_len() to validate no-deadlock behavior; backpressure test now uses the consumer connection and also counts error overflow.
    • Validates ACL permission kinds (produce, consume, admin) in binary handlers; rejects invalid values with InvalidConfigValue.
    • CLI TLS/mTLS connection setup simplified and hardened (one root store; correct client-auth config).
  • Migration

    • Config: Use one listen_addr for client/CLI; remove all gRPC settings. Cluster still uses cluster.bind_addr.
    • CLI: Connect to host:port (no scheme). Same port as the SDK.
    • TLS/mTLS: Unchanged; served on the single port.
    • Auth: API key is validated during the connection handshake; unauthorized clients fail immediately.

Written for commit 24ee387. Summary will update on new commits.

Benchmark Results (vs main baseline)

Baseline commit: 8d2d347 PR commit: e228ea9 Threshold: 10%

Benchmark Baseline Current Change Unit
compaction_active_p99 0.19 0.19 -1.7% ms
compaction_idle_p99 0.23 0.17 -22.3% ms 🟢
compaction_p99_delta -0.03 0.01 +145.5% ms 🔴
consumer_concurrency_100_throughput 3515.33 3547.00 +0.9% msg/s
consumer_concurrency_10_throughput 697.33 517.00 -25.9% msg/s 🔴
consumer_concurrency_1_throughput 77.00 75.00 -2.6% msg/s
e2e_latency_p50_light 40.65 40.75 +0.3% ms
e2e_latency_p95_light 40.76 40.84 +0.2% ms
e2e_latency_p99_light 40.82 40.91 +0.2% ms
enqueue_throughput_1kb 6763.05 6772.17 +0.1% msg/s
enqueue_throughput_1kb_mbps 6.60 6.61 +0.1% MB/s
fairness_accuracy_max_deviation 3.20 5.00 +56.3% % deviation 🔴
fairness_accuracy_tenant-1 3.20 5.00 +56.3% % deviation 🔴
fairness_accuracy_tenant-2 0.95 0.10 -89.5% % deviation 🟢
fairness_accuracy_tenant-3 0.40 0.30 -25.0% % deviation 🟢
fairness_accuracy_tenant-4 0.25 0.62 +150.0% % deviation 🔴
fairness_accuracy_tenant-5 1.00 0.88 -12.0% % deviation 🟢
fairness_overhead_fair_throughput 78.74 140.57 +78.5% msg/s 🟢
fairness_overhead_fifo_throughput 83.97 141.53 +68.5% msg/s 🟢
fairness_overhead_pct 6.53 -0.00 -100.0% % 🟢
key_cardinality_10_throughput 5639.16 5680.98 +0.7% msg/s
key_cardinality_10k_throughput 621.09 649.21 +4.5% msg/s
key_cardinality_1k_throughput 1362.04 1423.10 +4.5% msg/s
lua_on_enqueue_overhead_us 90.00 198.59 +120.7% us 🔴
lua_throughput_with_hook 75.65 143.53 +89.7% msg/s 🟢
memory_per_message_overhead 3245.67 2972.47 -8.4% bytes/msg
memory_rss_idle 177.73 177.08 -0.4% MB
memory_rss_loaded_10k 208.68 205.43 -1.6% MB

Summary: 6 regressed, 8 improved, 14 unchanged

⚠️ Performance regression detected — 6 metric(s) exceeded the 10% threshold

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 47 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="_bmad-output/implementation-artifacts/epic-execution-state.yaml">

<violation number="1" location="_bmad-output/implementation-artifacts/epic-execution-state.yaml:35">
P3: Keep the story status at `review` until the PR is merged so epic-review can set it to done.

(Based on your team's feedback about keeping story status as `review` while PRs are open.) [FEEDBACK_USED]</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

- id: "20.5"
title: "gRPC Removal & Final Benchmarks"
status: pending
status: completed
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: Keep the story status at review until the PR is merged so epic-review can set it to done.

(Based on your team's feedback about keeping story status as review while PRs are open.)

View Feedback

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At _bmad-output/implementation-artifacts/epic-execution-state.yaml, line 35:

<comment>Keep the story status at `review` until the PR is merged so epic-review can set it to done.

(Based on your team's feedback about keeping story status as `review` while PRs are open.) </comment>

<file context>
@@ -32,9 +32,9 @@ stories:
   - id: "20.5"
     title: "gRPC Removal & Final Benchmarks"
-    status: pending
+    status: completed
     currentPhase: ""
-    branch: ""
</file context>
Suggested change
status: completed
status: review
Fix with Cubic

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 3 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/fila-sdk/src/client.rs">

<violation number="1">
P2: `leader_client` is dropped when the async block returns, closing the TCP write half of the redirected connection. The returned `ConsumeStream` holds only the mpsc receiver — no reference to the client keeps it alive. If the server treats the write-side FIN as a disconnect, the consume stream stops receiving deliveries. Consider storing the `leader_client` alongside the stream (e.g., in a wrapper struct) so the connection stays fully open.</violation>

<violation number="2" location="crates/fila-sdk/src/client.rs:608">
P2: Stale oneshot sender left in `pending` after the timeout path. When the 100ms timeout expires (consume accepted), the code sets up `delivery_tx` and `consume_request_id` but never removes the oneshot from `inner.pending`. The receiver side is dropped, so this entry is a leak that persists until disconnect.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread crates/fila-sdk/src/client.rs Outdated
Comment thread crates/fila-sdk/src/client.rs
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 3 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/fila-e2e/tests/cluster.rs">

<violation number="1" location="crates/fila-e2e/tests/cluster.rs:427">
P2: Reconnect resolves to a potentially stale leader index, which can make this redirect test flaky during leader changes.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread crates/fila-e2e/tests/cluster.rs
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/fila-bench/src/benchmarks/latency.rs">

<violation number="1" location="crates/fila-bench/src/benchmarks/latency.rs:59">
P2: Background pressure messages are never cleaned up, so each run accumulates persisted backlog and can distort later benchmark results.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread crates/fila-bench/src/benchmarks/latency.rs Outdated
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/fila-sdk/src/client.rs">

<violation number="1" location="crates/fila-sdk/src/client.rs:116">
P2: Doc comments say "Default: 1024" but `DEFAULT_DELIVERY_BUFFER_SIZE` is `256`. Update the docs to match the actual default.</violation>

<violation number="2" location="crates/fila-sdk/src/client.rs:119">
P1: `ConnectOptions` derives `Default`, so `delivery_buffer_size` defaults to `0`. When `consume()` calls `mpsc::channel(self.delivery_buffer_size)`, tokio panics because a buffer size of 0 is not allowed. Either give `delivery_buffer_size` a custom `Default` impl that uses `DEFAULT_DELIVERY_BUFFER_SIZE`, or add a `.max(1)` guard at the channel creation site.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread crates/fila-sdk/src/client.rs Outdated
Comment thread crates/fila-sdk/src/client.rs Outdated
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/fila-sdk/src/client.rs">

<violation number="1" location="crates/fila-sdk/src/client.rs:122">
P2: Doc comments for `delivery_buffer_size` and `with_delivery_buffer_size` say "Default: 1024" but the actual default is now 32. Update both doc comments to match.</violation>

<violation number="2" location="crates/fila-sdk/src/client.rs:662">
P1: The unbounded internal channel defeats TCP backpressure. The background reader never blocks on `internal_tx.send()`, so it continuously drains the TCP socket regardless of consumer speed. If the consumer is slow, the unbounded channel grows without limit — the server is never back-pressured. Consider using a bounded internal channel with a generous capacity, or having the reader itself send to the bounded user channel (which was the previous design but with `try_send` to avoid deadlock).</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread crates/fila-sdk/src/client.rs Outdated
Comment thread crates/fila-sdk/src/client.rs Outdated
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/fila-sdk/src/client.rs">

<violation number="1" location="crates/fila-sdk/src/client.rs:943">
P1: `notify_all_disconnected` borrows the internal sender with `ref` instead of taking it, so when `try_send` fails (channel full), the sender is never dropped. After the background reader exits, the forwarder task blocks on `recv()` forever (sender still alive), and the consumer stream hangs indefinitely. Use `.take()` to ensure the channel is closed on disconnect regardless of whether the error was delivered.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread crates/fila-sdk/src/client.rs
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/fila-sdk/src/client.rs">

<violation number="1" location="crates/fila-sdk/src/client.rs:665">
P1: Using an unbounded internal delivery channel can cause unbounded memory growth when consume clients are slower than incoming delivery traffic.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread crates/fila-sdk/src/client.rs Outdated
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 3 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/fila-sdk/src/client.rs">

<violation number="1" location="crates/fila-sdk/src/client.rs:697">
P1: Counter bookkeeping is unbalanced: decrementing on every forwarded item but incrementing only for successful delivery messages can underflow and permanently trigger backpressure, stalling the reader.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread crates/fila-sdk/src/client.rs Outdated
@vieiralucas vieiralucas force-pushed the feat/20.4-cli-cluster-migration branch from 2dbd6fc to b9c1f03 Compare April 2, 2026 11:29
Base automatically changed from feat/20.4-cli-cluster-migration to main April 3, 2026 19:10
@vieiralucas vieiralucas force-pushed the feat/20.5-grpc-removal branch from 09b99bf to 47a267b Compare April 3, 2026 19:27
…sure test

- add delivery_buffer_size to ConnectOptions (test-internals feature)
- add delivery_internal_len() to FilaClient to query overflow size
- track overflow length via shared AtomicUsize counter
- rewrite backpressure test to validate no-deadlock guarantee instead
  of bounded memory (TCP backpressure removed to prevent response starvation)
- simplify leader redirect to clone ConnectOptions instead of field-by-field
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 2 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/fila-sdk/tests/backpressure.rs">

<violation number="1" location="crates/fila-sdk/tests/backpressure.rs:105">
P2: This assertion uses the separate `producer` connection, so it does not validate that the stalled consumer connection can still perform operations. A deadlock on the consumer connection would still pass this test. Use the consumer client for the enqueue to actually test the same connection.</violation>
</file>

<file name="crates/fila-sdk/src/client.rs">

<violation number="1" location="crates/fila-sdk/src/client.rs:959">
P3: `delivery_overflow_len` can underflow because error-delivery overflows aren’t counted. If a delivery decode error overflows the channel, it’s pushed into the overflow deque without a matching `fetch_add`, but later `fetch_sub(drained, …)` subtracts it anyway. Add a counter increment when pushing error items into the overflow buffer.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread crates/fila-sdk/tests/backpressure.rs Outdated
Comment thread crates/fila-sdk/src/client.rs
@vieiralucas vieiralucas merged commit 47e3b6d into main Apr 3, 2026
8 checks passed
@vieiralucas vieiralucas deleted the feat/20.5-grpc-removal branch April 3, 2026 23:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant