feat: 20.5 grpc removal — binary protocol is sole transport by vieiralucas · Pull Request #162 · faiscadev/fila

vieiralucas · 2026-03-31T11:19:08Z

Summary

Removed gRPC server (tonic), admin/auth service layers, service.proto, admin.proto
Binary protocol is now the sole transport for all client, CLI, and cluster communication
Removed ~3,400 lines of gRPC code
Kept cluster.proto only for Raft entry serialization (prost still used by proto_convert.rs)
Fixed SDK consume to not block waiting for initial response (caused benchmark hang)
Bench server simplified to single port (no more dual gRPC + binary)

Net code change: -3,087 lines

Test plan

All 348 fila-core tests pass (including cluster)
All 15 fila-server binary protocol tests pass
All 27 fila-fibp codec tests pass
cargo clippy --workspace -- -D warnings clean
cargo fmt --check clean
E2e tests — CI will verify
Benchmarks — CI will verify (local hang being investigated)

🤖 Generated with Claude Code

Summary by cubic

Removed all gRPC code and moved all client, CLI, and cluster traffic to the binary protocol on a single port. Finalized leader-aware routing and a non-blocking SDK consume pipeline with application-level buffering and a no-deadlock guarantee.

Refactors
- Removed tonic servers and admin/auth layers; deleted service.proto and admin.proto; dropped tonic, tower, and http deps.
- Kept cluster.proto only for Raft serialization; switched to prost/prost-build with include!(env!("OUT_DIR")).
- Replaced cluster RPCs with FIBP handlers; unified to one client/CLI address across fila-core, fila-server, fila-bench, fila-e2e, and fila-sdk.
- Reintroduced consume leader redirect using leader-hint errors; simplified redirect by cloning ConnectOptions.
- SDK consume path: two-stage delivery with an overflow buffer (no TCP read pausing to avoid response starvation); CancelConsume on stream drop; added delivery_buffer_size (test-only) and delivery_internal_len() to validate no-deadlock behavior; backpressure test now uses the consumer connection and also counts error overflow.
- Validates ACL permission kinds (produce, consume, admin) in binary handlers; rejects invalid values with InvalidConfigValue.
- CLI TLS/mTLS connection setup simplified and hardened (one root store; correct client-auth config).
Migration
- Config: Use one listen_addr for client/CLI; remove all gRPC settings. Cluster still uses cluster.bind_addr.
- CLI: Connect to host:port (no scheme). Same port as the SDK.
- TLS/mTLS: Unchanged; served on the single port.
- Auth: API key is validated during the connection handshake; unauthorized clients fail immediately.

^{Written for commit 24ee387. Summary will update on new commits.}

Benchmark Results (vs main baseline)

Baseline commit: 8d2d347 PR commit: e228ea9 Threshold: 10%

Benchmark	Baseline	Current	Change	Unit
compaction_active_p99	0.19	0.19	-1.7%	ms
compaction_idle_p99	0.23	0.17	-22.3%	ms	🟢
compaction_p99_delta	-0.03	0.01	+145.5%	ms	🔴
consumer_concurrency_100_throughput	3515.33	3547.00	+0.9%	msg/s
consumer_concurrency_10_throughput	697.33	517.00	-25.9%	msg/s	🔴
consumer_concurrency_1_throughput	77.00	75.00	-2.6%	msg/s
e2e_latency_p50_light	40.65	40.75	+0.3%	ms
e2e_latency_p95_light	40.76	40.84	+0.2%	ms
e2e_latency_p99_light	40.82	40.91	+0.2%	ms
enqueue_throughput_1kb	6763.05	6772.17	+0.1%	msg/s
enqueue_throughput_1kb_mbps	6.60	6.61	+0.1%	MB/s
fairness_accuracy_max_deviation	3.20	5.00	+56.3%	% deviation	🔴
fairness_accuracy_tenant-1	3.20	5.00	+56.3%	% deviation	🔴
fairness_accuracy_tenant-2	0.95	0.10	-89.5%	% deviation	🟢
fairness_accuracy_tenant-3	0.40	0.30	-25.0%	% deviation	🟢
fairness_accuracy_tenant-4	0.25	0.62	+150.0%	% deviation	🔴
fairness_accuracy_tenant-5	1.00	0.88	-12.0%	% deviation	🟢
fairness_overhead_fair_throughput	78.74	140.57	+78.5%	msg/s	🟢
fairness_overhead_fifo_throughput	83.97	141.53	+68.5%	msg/s	🟢
fairness_overhead_pct	6.53	-0.00	-100.0%	%	🟢
key_cardinality_10_throughput	5639.16	5680.98	+0.7%	msg/s
key_cardinality_10k_throughput	621.09	649.21	+4.5%	msg/s
key_cardinality_1k_throughput	1362.04	1423.10	+4.5%	msg/s
lua_on_enqueue_overhead_us	90.00	198.59	+120.7%	us	🔴
lua_throughput_with_hook	75.65	143.53	+89.7%	msg/s	🟢
memory_per_message_overhead	3245.67	2972.47	-8.4%	bytes/msg
memory_rss_idle	177.73	177.08	-0.4%	MB
memory_rss_loaded_10k	208.68	205.43	-1.6%	MB

Summary: 6 regressed, 8 improved, 14 unchanged

⚠️ Performance regression detected — 6 metric(s) exceeded the 10% threshold

cubic-dev-ai

1 issue found across 47 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="_bmad-output/implementation-artifacts/epic-execution-state.yaml">

<violation number="1" location="_bmad-output/implementation-artifacts/epic-execution-state.yaml:35">
P3: Keep the story status at `review` until the PR is merged so epic-review can set it to done.

(Based on your team's feedback about keeping story status as `review` while PRs are open.) [FEEDBACK_USED]</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai · 2026-03-31T11:23:19Z

  - id: "20.5"
    title: "gRPC Removal & Final Benchmarks"
-    status: pending
+    status: completed


P3: Keep the story status at review until the PR is merged so epic-review can set it to done.

(Based on your team's feedback about keeping story status as review while PRs are open.)

View Feedback

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At _bmad-output/implementation-artifacts/epic-execution-state.yaml, line 35: <comment>Keep the story status at `review` until the PR is merged so epic-review can set it to done. (Based on your team's feedback about keeping story status as `review` while PRs are open.) </comment> <file context> @@ -32,9 +32,9 @@ stories: - id: "20.5" title: "gRPC Removal & Final Benchmarks" - status: pending + status: completed currentPhase: "" - branch: "" </file context>

Suggested change

status: completed

status: review

cubic-dev-ai

2 issues found across 3 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/fila-sdk/src/client.rs">

<violation number="1">
P2: `leader_client` is dropped when the async block returns, closing the TCP write half of the redirected connection. The returned `ConsumeStream` holds only the mpsc receiver — no reference to the client keeps it alive. If the server treats the write-side FIN as a disconnect, the consume stream stops receiving deliveries. Consider storing the `leader_client` alongside the stream (e.g., in a wrapper struct) so the connection stays fully open.</violation>

<violation number="2" location="crates/fila-sdk/src/client.rs:608">
P2: Stale oneshot sender left in `pending` after the timeout path. When the 100ms timeout expires (consume accepted), the code sets up `delivery_tx` and `consume_request_id` but never removes the oneshot from `inner.pending`. The receiver side is dropped, so this entry is a leak that persists until disconnect.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai

1 issue found across 3 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/fila-e2e/tests/cluster.rs">

<violation number="1" location="crates/fila-e2e/tests/cluster.rs:427">
P2: Reconnect resolves to a potentially stale leader index, which can make this redirect test flaky during leader changes.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/fila-bench/src/benchmarks/latency.rs">

<violation number="1" location="crates/fila-bench/src/benchmarks/latency.rs:59">
P2: Background pressure messages are never cleaned up, so each run accumulates persisted backlog and can distort later benchmark results.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai

2 issues found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/fila-sdk/src/client.rs">

<violation number="1" location="crates/fila-sdk/src/client.rs:116">
P2: Doc comments say "Default: 1024" but `DEFAULT_DELIVERY_BUFFER_SIZE` is `256`. Update the docs to match the actual default.</violation>

<violation number="2" location="crates/fila-sdk/src/client.rs:119">
P1: `ConnectOptions` derives `Default`, so `delivery_buffer_size` defaults to `0`. When `consume()` calls `mpsc::channel(self.delivery_buffer_size)`, tokio panics because a buffer size of 0 is not allowed. Either give `delivery_buffer_size` a custom `Default` impl that uses `DEFAULT_DELIVERY_BUFFER_SIZE`, or add a `.max(1)` guard at the channel creation site.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai

2 issues found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/fila-sdk/src/client.rs">

<violation number="1" location="crates/fila-sdk/src/client.rs:122">
P2: Doc comments for `delivery_buffer_size` and `with_delivery_buffer_size` say "Default: 1024" but the actual default is now 32. Update both doc comments to match.</violation>

<violation number="2" location="crates/fila-sdk/src/client.rs:662">
P1: The unbounded internal channel defeats TCP backpressure. The background reader never blocks on `internal_tx.send()`, so it continuously drains the TCP socket regardless of consumer speed. If the consumer is slow, the unbounded channel grows without limit — the server is never back-pressured. Consider using a bounded internal channel with a generous capacity, or having the reader itself send to the bounded user channel (which was the previous design but with `try_send` to avoid deadlock).</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/fila-sdk/src/client.rs">

<violation number="1" location="crates/fila-sdk/src/client.rs:943">
P1: `notify_all_disconnected` borrows the internal sender with `ref` instead of taking it, so when `try_send` fails (channel full), the sender is never dropped. After the background reader exits, the forwarder task blocks on `recv()` forever (sender still alive), and the consumer stream hangs indefinitely. Use `.take()` to ensure the channel is closed on disconnect regardless of whether the error was delivered.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/fila-sdk/src/client.rs">

<violation number="1" location="crates/fila-sdk/src/client.rs:665">
P1: Using an unbounded internal delivery channel can cause unbounded memory growth when consume clients are slower than incoming delivery traffic.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai

1 issue found across 3 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/fila-sdk/src/client.rs">

<violation number="1" location="crates/fila-sdk/src/client.rs:697">
P1: Counter bookkeeping is unbalanced: decrementing on every forwarded item but incrementing only for successful delivery messages can underflow and permanently trigger backpressure, stalling the reader.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

…roto — binary protocol is sole transport

…e dead code

…wrap

…consume

…onsume blocking wait

…l channel hits high-water mark (256), with test proving bounded growth

…ure test runs in ci

- switch crypto provider from ring to aws-lc-rs (tonic no longer pulls ring) - remove binary_addr config field reference (only listen_addr remains) - add missing consume_error_from_code function - fix binary_addr() references in sdk integration tests - fix tls e2e test assert using removed variables

…sure test - add delivery_buffer_size to ConnectOptions (test-internals feature) - add delivery_internal_len() to FilaClient to query overflow size - track overflow length via shared AtomicUsize counter - rewrite backpressure test to validate no-deadlock guarantee instead of bounded memory (TCP backpressure removed to prevent response starvation) - simplify leader redirect to clone ConnectOptions instead of field-by-field

cubic-dev-ai

2 issues found across 2 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="crates/fila-sdk/tests/backpressure.rs">

<violation number="1" location="crates/fila-sdk/tests/backpressure.rs:105">
P2: This assertion uses the separate `producer` connection, so it does not validate that the stalled consumer connection can still perform operations. A deadlock on the consumer connection would still pass this test. Use the consumer client for the enqueue to actually test the same connection.</violation>
</file>

<file name="crates/fila-sdk/src/client.rs">

<violation number="1" location="crates/fila-sdk/src/client.rs:959">
P3: `delivery_overflow_len` can underflow because error-delivery overflows aren’t counted. If a delivery decode error overflows the channel, it’s pushed into the overflow deque without a matching `fetch_add`, but later `fetch_sub(drained, …)` subtracts it anyway. Add a counter increment when pushing error items into the overflow buffer.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

…overflow

cubic-dev-ai Bot reviewed Mar 31, 2026

View reviewed changes

Comment thread crates/fila-sdk/src/client.rs Outdated

Comment thread crates/fila-sdk/src/client.rs

cubic-dev-ai Bot reviewed Mar 31, 2026

View reviewed changes

Comment thread crates/fila-e2e/tests/cluster.rs

cubic-dev-ai Bot reviewed Mar 31, 2026

View reviewed changes

Comment thread crates/fila-bench/src/benchmarks/latency.rs Outdated

cubic-dev-ai Bot reviewed Mar 31, 2026

View reviewed changes

Comment thread crates/fila-sdk/src/client.rs Outdated

Comment thread crates/fila-sdk/src/client.rs Outdated

cubic-dev-ai Bot reviewed Mar 31, 2026

View reviewed changes

Comment thread crates/fila-sdk/src/client.rs Outdated

Comment thread crates/fila-sdk/src/client.rs Outdated

cubic-dev-ai Bot reviewed Mar 31, 2026

View reviewed changes

Comment thread crates/fila-sdk/src/client.rs

cubic-dev-ai Bot reviewed Mar 31, 2026

View reviewed changes

Comment thread crates/fila-sdk/src/client.rs Outdated

cubic-dev-ai Bot reviewed Mar 31, 2026

View reviewed changes

Comment thread crates/fila-sdk/src/client.rs Outdated

vieiralucas force-pushed the feat/20.4-cli-cluster-migration branch from 2dbd6fc to b9c1f03 Compare April 2, 2026 11:29

Base automatically changed from feat/20.4-cli-cluster-migration to main April 3, 2026 19:10

vieiralucas added 13 commits April 3, 2026 16:13

ci: re-trigger after runner timeout

d39e380

chore: mark story 20.5 in-progress

aafee2f

feat: remove grpc server, admin/auth services, service.proto, admin.p…

da8d819

…roto — binary protocol is sole transport

fix: sdk consume no longer blocks waiting for initial response, remov…

91864cf

…e dead code

chore: mark story 20.5 done, epic 20 complete

56e75c2

fix: cli mtls reuses root store instead of re-reading ca file with un…

ea03581

…wrap

fix: validate acl permission kinds in binary protocol handler

1d53c2d

fix: cluster consume routing, dlq test uses separate connections per …

98807a5

…consume

fix: cancelconsume on stream drop, document cluster opcodes, remove c…

83fc006

…onsume blocking wait

feat: application-level backpressure — reader pauses tcp when interna…

97dd8d8

…l channel hits high-water mark (256), with test proving bounded growth

fix: gate backpressure test behind test-internals feature so ci compiles

6cfad50

fix: high-water mark derived from delivery_buffer_size * 2, backpress…

cf0e8c4

…ure test runs in ci

vieiralucas force-pushed the feat/20.5-grpc-removal branch from 09b99bf to 47a267b Compare April 3, 2026 19:27

cubic-dev-ai Bot reviewed Apr 3, 2026

View reviewed changes

Comment thread crates/fila-sdk/tests/backpressure.rs Outdated

Comment thread crates/fila-sdk/src/client.rs

fix: use consumer connection for backpressure assertion, count error …

24ee387

…overflow

vieiralucas merged commit 47e3b6d into main Apr 3, 2026
8 checks passed

vieiralucas deleted the feat/20.5-grpc-removal branch April 3, 2026 23:50

Conversation

vieiralucas commented Mar 31, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by cubic

Benchmark Results (vs main baseline)

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vieiralucas commented Mar 31, 2026 •

edited by github-actions Bot

Loading

cubic-dev-ai Bot Mar 31, 2026 •

edited

Loading