feat: 20.5 grpc removal — binary protocol is sole transport#162
feat: 20.5 grpc removal — binary protocol is sole transport#162vieiralucas merged 15 commits intomainfrom
Conversation
There was a problem hiding this comment.
1 issue found across 47 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="_bmad-output/implementation-artifacts/epic-execution-state.yaml">
<violation number="1" location="_bmad-output/implementation-artifacts/epic-execution-state.yaml:35">
P3: Keep the story status at `review` until the PR is merged so epic-review can set it to done.
(Based on your team's feedback about keeping story status as `review` while PRs are open.) [FEEDBACK_USED]</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| - id: "20.5" | ||
| title: "gRPC Removal & Final Benchmarks" | ||
| status: pending | ||
| status: completed |
There was a problem hiding this comment.
P3: Keep the story status at review until the PR is merged so epic-review can set it to done.
(Based on your team's feedback about keeping story status as review while PRs are open.)
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At _bmad-output/implementation-artifacts/epic-execution-state.yaml, line 35:
<comment>Keep the story status at `review` until the PR is merged so epic-review can set it to done.
(Based on your team's feedback about keeping story status as `review` while PRs are open.) </comment>
<file context>
@@ -32,9 +32,9 @@ stories:
- id: "20.5"
title: "gRPC Removal & Final Benchmarks"
- status: pending
+ status: completed
currentPhase: ""
- branch: ""
</file context>
| status: completed | |
| status: review |
There was a problem hiding this comment.
2 issues found across 3 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="crates/fila-sdk/src/client.rs">
<violation number="1">
P2: `leader_client` is dropped when the async block returns, closing the TCP write half of the redirected connection. The returned `ConsumeStream` holds only the mpsc receiver — no reference to the client keeps it alive. If the server treats the write-side FIN as a disconnect, the consume stream stops receiving deliveries. Consider storing the `leader_client` alongside the stream (e.g., in a wrapper struct) so the connection stays fully open.</violation>
<violation number="2" location="crates/fila-sdk/src/client.rs:608">
P2: Stale oneshot sender left in `pending` after the timeout path. When the 100ms timeout expires (consume accepted), the code sets up `delivery_tx` and `consume_request_id` but never removes the oneshot from `inner.pending`. The receiver side is dropped, so this entry is a leak that persists until disconnect.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
There was a problem hiding this comment.
1 issue found across 3 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="crates/fila-e2e/tests/cluster.rs">
<violation number="1" location="crates/fila-e2e/tests/cluster.rs:427">
P2: Reconnect resolves to a potentially stale leader index, which can make this redirect test flaky during leader changes.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
There was a problem hiding this comment.
1 issue found across 2 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="crates/fila-bench/src/benchmarks/latency.rs">
<violation number="1" location="crates/fila-bench/src/benchmarks/latency.rs:59">
P2: Background pressure messages are never cleaned up, so each run accumulates persisted backlog and can distort later benchmark results.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
There was a problem hiding this comment.
2 issues found across 1 file (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="crates/fila-sdk/src/client.rs">
<violation number="1" location="crates/fila-sdk/src/client.rs:116">
P2: Doc comments say "Default: 1024" but `DEFAULT_DELIVERY_BUFFER_SIZE` is `256`. Update the docs to match the actual default.</violation>
<violation number="2" location="crates/fila-sdk/src/client.rs:119">
P1: `ConnectOptions` derives `Default`, so `delivery_buffer_size` defaults to `0`. When `consume()` calls `mpsc::channel(self.delivery_buffer_size)`, tokio panics because a buffer size of 0 is not allowed. Either give `delivery_buffer_size` a custom `Default` impl that uses `DEFAULT_DELIVERY_BUFFER_SIZE`, or add a `.max(1)` guard at the channel creation site.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
There was a problem hiding this comment.
2 issues found across 1 file (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="crates/fila-sdk/src/client.rs">
<violation number="1" location="crates/fila-sdk/src/client.rs:122">
P2: Doc comments for `delivery_buffer_size` and `with_delivery_buffer_size` say "Default: 1024" but the actual default is now 32. Update both doc comments to match.</violation>
<violation number="2" location="crates/fila-sdk/src/client.rs:662">
P1: The unbounded internal channel defeats TCP backpressure. The background reader never blocks on `internal_tx.send()`, so it continuously drains the TCP socket regardless of consumer speed. If the consumer is slow, the unbounded channel grows without limit — the server is never back-pressured. Consider using a bounded internal channel with a generous capacity, or having the reader itself send to the bounded user channel (which was the previous design but with `try_send` to avoid deadlock).</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="crates/fila-sdk/src/client.rs">
<violation number="1" location="crates/fila-sdk/src/client.rs:943">
P1: `notify_all_disconnected` borrows the internal sender with `ref` instead of taking it, so when `try_send` fails (channel full), the sender is never dropped. After the background reader exits, the forwarder task blocks on `recv()` forever (sender still alive), and the consumer stream hangs indefinitely. Use `.take()` to ensure the channel is closed on disconnect regardless of whether the error was delivered.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="crates/fila-sdk/src/client.rs">
<violation number="1" location="crates/fila-sdk/src/client.rs:665">
P1: Using an unbounded internal delivery channel can cause unbounded memory growth when consume clients are slower than incoming delivery traffic.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
There was a problem hiding this comment.
1 issue found across 3 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="crates/fila-sdk/src/client.rs">
<violation number="1" location="crates/fila-sdk/src/client.rs:697">
P1: Counter bookkeeping is unbalanced: decrementing on every forwarded item but incrementing only for successful delivery messages can underflow and permanently trigger backpressure, stalling the reader.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
2dbd6fc to
b9c1f03
Compare
…roto — binary protocol is sole transport
…onsume blocking wait
…l channel hits high-water mark (256), with test proving bounded growth
…ure test runs in ci
- switch crypto provider from ring to aws-lc-rs (tonic no longer pulls ring) - remove binary_addr config field reference (only listen_addr remains) - add missing consume_error_from_code function - fix binary_addr() references in sdk integration tests - fix tls e2e test assert using removed variables
09b99bf to
47a267b
Compare
…sure test - add delivery_buffer_size to ConnectOptions (test-internals feature) - add delivery_internal_len() to FilaClient to query overflow size - track overflow length via shared AtomicUsize counter - rewrite backpressure test to validate no-deadlock guarantee instead of bounded memory (TCP backpressure removed to prevent response starvation) - simplify leader redirect to clone ConnectOptions instead of field-by-field
There was a problem hiding this comment.
2 issues found across 2 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="crates/fila-sdk/tests/backpressure.rs">
<violation number="1" location="crates/fila-sdk/tests/backpressure.rs:105">
P2: This assertion uses the separate `producer` connection, so it does not validate that the stalled consumer connection can still perform operations. A deadlock on the consumer connection would still pass this test. Use the consumer client for the enqueue to actually test the same connection.</violation>
</file>
<file name="crates/fila-sdk/src/client.rs">
<violation number="1" location="crates/fila-sdk/src/client.rs:959">
P3: `delivery_overflow_len` can underflow because error-delivery overflows aren’t counted. If a delivery decode error overflows the channel, it’s pushed into the overflow deque without a matching `fetch_add`, but later `fetch_sub(drained, …)` subtracts it anyway. Add a counter increment when pushing error items into the overflow buffer.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Summary
Net code change: -3,087 lines
Test plan
cargo clippy --workspace -- -D warningscleancargo fmt --checkclean🤖 Generated with Claude Code
Summary by cubic
Removed all gRPC code and moved all client, CLI, and cluster traffic to the binary protocol on a single port. Finalized leader-aware routing and a non-blocking SDK consume pipeline with application-level buffering and a no-deadlock guarantee.
Refactors
tonicservers and admin/auth layers; deletedservice.protoandadmin.proto; droppedtonic,tower, andhttpdeps.cluster.protoonly for Raft serialization; switched toprost/prost-buildwithinclude!(env!("OUT_DIR")).fila-core,fila-server,fila-bench,fila-e2e, andfila-sdk.ConnectOptions.CancelConsumeon stream drop; addeddelivery_buffer_size(test-only) anddelivery_internal_len()to validate no-deadlock behavior; backpressure test now uses the consumer connection and also counts error overflow.produce,consume,admin) in binary handlers; rejects invalid values withInvalidConfigValue.Migration
listen_addrfor client/CLI; remove all gRPC settings. Cluster still usescluster.bind_addr.Written for commit 24ee387. Summary will update on new commits.
Benchmark Results (vs main baseline)
Baseline commit:
8d2d347PR commit:e228ea9Threshold: 10%Summary: 6 regressed, 8 improved, 14 unchanged