Skip to content

fix: integration test readiness for fibp#7

Merged
vieiralucas merged 4 commits intomainfrom
fix/fibp-readiness-and-protocol
Mar 26, 2026
Merged

fix: integration test readiness for fibp#7
vieiralucas merged 4 commits intomainfrom
fix/fibp-readiness-and-protocol

Conversation

@vieiralucas
Copy link
Copy Markdown
Member

@vieiralucas vieiralucas commented Mar 26, 2026

Summary

This PR fixes multiple bugs that caused integration tests to fail with "fila-server failed to start within 10s" and other errors. The root cause was a combination of an insufficient readiness probe and several protocol encoding bugs.

Readiness probe (test/test_helper.rb)

  • Was: plain TCP connect — succeeds as soon as the listener binds, before FIBP is ready
  • Now: full FIBP handshake probe (FIBP\x01\x00 → echo) so we only proceed when the server can actually serve FIBP requests
  • mTLS: probe now supplies client cert/key when the server requires mTLS, so the TLS handshake succeeds during the probe
  • Port race: retry server spawn up to 3 times when another process races to bind the chosen port

Admin create_queue encoding (test/test_helper.rb)

  • Admin ops use protobuf-encoded payloads. The old code sent a custom binary format; the server rejected it with "failed to decode Protobuf message: invalid tag value: 0"
  • Fixed: hand-encode CreateQueueRequest { name: "..." } in proto3 wire format (field 1, wire type 2)

Consume push routing (lib/fila/transport.rb)

  • Server-push frames carry corr_id=0 regardless of the consume request's corr_id. The push queue was registered only at the request corr_id, so push frames were silently dropped
  • Fixed: register push queue at corr_id=0; use a separate one-shot queue for the consume ACK

OP_AUTH payload format (lib/fila/transport.rb)

  • Auth frame payload is raw key bytes — no length prefix. The old code prepended a u16 length, which the server decoded as the first two bytes of the key, causing "invalid or missing api key"

OP_ERROR frame parsing (lib/fila/transport.rb)

  • OP_ERROR frames carry plain UTF-8 text, not code:u16 + len:u16 + msg. The old parser read garbage bytes as an error code
  • Fixed: parse text directly; derive error type (RPCError, QueueNotFoundError, etc.) from message content

Codec module function visibility (lib/fila/codec.rb)

  • module_function only applies to methods defined before private. Helper methods (encode_message, encode_str16, read_*) were after private, making them unavailable as Codec.method_name from the batcher thread
  • Fixed: move all helpers before private; remove the private block entirely
  • alias decode_nack_response replaced with an explicit def (aliases don't inherit module_function)

Consume push wire format (lib/fila/codec.rb)

  • Decoder had a phantom queue_id: string16 field that doesn't exist in the server's wire format, causing nil.unpack1 on every consumed message
  • Fixed: remove the field; pass queue name from the caller (Ruby SDK stores it from the consume request, same as the Rust SDK)

Batcher error propagation (lib/fila/batcher.rb)

  • Re-wrapping exceptions in Fila::Error.new(e.message) lost the specific type (QueueNotFoundError, RPCError). Tests asserting assert_raises(Fila::RPCError) would fail
  • Fixed: broadcast the original exception directly

Test plan

  • All 28 integration tests pass locally with the dev binary
  • test/test_client.rb — 3/3 passing
  • test/test_batch.rb — 16/16 passing
  • test/test_tls_auth.rb — 9/9 passing (TLS, mTLS, API key, backward compat)

🤖 Generated with Claude Code


Summary by cubic

Fixes flaky integration tests by using a real FIBP readiness probe, correcting protocol mismatches, and guarding TLS tests when the server lacks TLS. Startup is reliable now, server push/auth/error paths match the wire format, and corr_id wraparound no longer collides with server-push.

  • Bug Fixes
    • Readiness: perform a full FIBP handshake (with mTLS when needed); retry port bind up to 3 times.
    • Admin: send protobuf-encoded CreateQueueRequest for create_queue.
    • Consume push: register push queue at corr_id=0 and use a one-shot ACK queue; client passes queue name to the decoder.
    • Correlation IDs: reserve 0 for server-push; cycle request IDs through 1..0xFFFFFFFF to avoid collisions.
    • Auth: OP_AUTH payload is raw API key bytes (no length prefix).
    • Errors: parse OP_ERROR as UTF-8 text and map to typed errors by message.
    • Codec: expose helpers via module_function, define decode_nack_response, remove the non-existent queue_id from consume push decoding.
    • Batcher: re-broadcast the original exception type (preserves QueueNotFoundError, RPCError, etc.).
    • TLS tests: probe TLS support at load time and skip when unsupported to avoid server panics.

Written for commit 86dcc4f. Summary will update on new commits.

fix(test_helper): replace plain tcp connect with fibp handshake probe
- plain tcp accept fires before server finishes fibp init; blocking
  handshake probe guarantees the server is ready to accept connections
- add mtls client-cert support in probe for servers requiring client certs
- retry server spawn up to 3 times to handle port toctou race condition

fix(transport): correct consume push routing (corr_id=0) and auth frame
- server-push frames carry corr_id=0; register push queue at 0 in addition
  to the consume request corr_id so push frames are dispatched correctly
- use separate one-shot ack_q for the consume ack so push frames and the
  ack don't land on the same queue
- op_auth payload is raw key bytes (no length prefix) — matches rust sdk
- parse_error_frame: op_error payloads are plain utf-8 text, not
  code:u16+len:u16+msg; derive error type from message content

fix(codec): all module_function helpers must precede private keyword
- encode_message, encode_str16, and all read_* helpers moved before
  private so module_function makes them callable as codec.method_name
  from the batcher background thread and other module-function contexts
- decode_nack_response alias replaced with explicit def (aliases do not
  inherit module_function)

fix(codec): remove phantom queue_id field from decode_consume_push
- server wire format is: count | msg_id | fairness_key | attempt_count |
  headers | payload — no queue_id field; pass queue name from caller

fix(client): pass queue name to decode_consume_push

fix(batcher): re-broadcast original exception type to preserve rpc errors
- wrapping in fila::error.new loses queuenotfounderror / rpcerror types

fix(test_helper): create_queue uses protobuf-encoded createqueuerequest
- admin ops use protobuf payloads; old binary format caused decode errors
- remove now-unnecessary rubocop:disable comments after method refactors
- extract auth/queue/message error predicates from parse_error_frame to
  lower cyclomatic complexity below the configured threshold
- replace io.select with sock.wait_readable per lint/incompatibleioselect
  rubocop cop recommendation
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 5 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="lib/fila/transport.rb">

<violation number="1" location="lib/fila/transport.rb:122">
P2: Using `@pending[0]` for push frames can collide with normal request correlation IDs when `next_corr_id` wraps to 0, breaking frame routing.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread lib/fila/transport.rb
probe tls support at load time using a minimal self-signed cert.
some bleeding-edge server binaries panic on tls startup due to a
missing rustls crypto provider (fixed in faiscadev/fila pr #140
but not yet released to dev-latest). tls tests guard themselves
with fila_tls_available so the suite does not fail when the server
binary lacks tls support.
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="test/test_tls_auth.rb">

<violation number="1" location="test/test_tls_auth.rb:6">
P2: This file-level guard now skips non-TLS tests whenever TLS is unavailable, reducing integration coverage for API-key and backward-compat paths.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread test/test_tls_auth.rb
require 'openssl'

return unless FILA_SERVER_AVAILABLE
return unless FILA_SERVER_AVAILABLE && FILA_TLS_AVAILABLE
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai bot Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: This file-level guard now skips non-TLS tests whenever TLS is unavailable, reducing integration coverage for API-key and backward-compat paths.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At test/test_tls_auth.rb, line 6:

<comment>This file-level guard now skips non-TLS tests whenever TLS is unavailable, reducing integration coverage for API-key and backward-compat paths.</comment>

<file context>
@@ -3,7 +3,7 @@
 require 'openssl'
 
-return unless FILA_SERVER_AVAILABLE
+return unless FILA_SERVER_AVAILABLE && FILA_TLS_AVAILABLE
 
 # Helper to generate self-signed CA and server/client certificates for tests.
</file context>
Fix with Cubic

…0xffffffff

corr_id=0 is permanently allocated to the server-push channel.
the previous implementation wrapped next_corr_id through 0, which
would have collided with the push channel after 2^32 requests.
identified by cubic.
@vieiralucas vieiralucas merged commit 6ed65c8 into main Mar 26, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant