fix: subscription webhook delivery stalling on HTTP errors#677
Open
sublimator wants to merge 21 commits into
Open
fix: subscription webhook delivery stalling on HTTP errors#677sublimator wants to merge 21 commits into
sublimator wants to merge 21 commits into
Conversation
fromNetwork() is async — it posts handlers to the io_service and returns immediately. The original sendThread() loop fires all queued events as concurrent HTTP connections at once. Under sustained load with a slow/failing endpoint, connections accumulate (each held up to 30s by RPC_NOTIFY timeout), exhausting file descriptors and breaking all network I/O for the entire process. Fix: use a local io_service per batch with .run() to block until the batch completes (same pattern as rpcClient() in RPCCall.cpp). This bounds concurrent connections to maxInFlight (32) per subscriber while still allowing parallel delivery. Also add a queue cap (maxQueueSize = 16384, ~80-160MB) so a hopelessly behind endpoint doesn't grow the deque indefinitely. Consumers detect gaps via the existing seq field. Ref: XRPLF/rippled#6341
When an HTTP response has no Content-Length header, HTTPClient reads until EOF. The EOF path in handleData logged "Complete." but never called invokeComplete(), leaving the socket held open for the full 30s deadline timeout and the completion callback never invoked. This is the likely root cause of webhook delivery permanently stalling after repeated 500 errors — many web frameworks omit Content-Length on error responses, triggering this path. Each leaked socket holds an FD for 30s, eventually exhausting the process FD budget. Includes HTTPClient_test with 12 test cases covering resource cleanup across success, error, timeout, connection-refused, concurrent request, and EOF-without-Content-Length scenarios.
- Advance mSeq when dropping events so consumers can detect gaps via sequence numbers, and log the dropped seq - Use ephemeral port (bind + close) instead of hardcoded 19999 for the connection-refused test to avoid false negatives on busy machines
Add test.net > ripple.basics dependency introduced by the new test file.
sendThread() now uses a local io_service per batch, so the app's io_service passed via make_RPCSub is dead code. Removes it from the header, constructor, factory function, and sole call site.
Resolves conflicts from the src/ripple -> src/xrpld + cmake restructure:
- src/xrpld/net/RPCSub.h: take dev's xrpld/ include paths, keep our
removal of the now-unused boost/asio/io_service.hpp include
- Builds/CMake/RippledCore.cmake: accept dev's deletion; tests are now
auto-discovered via GLOB_RECURSE over src/test, so the explicit
HTTPClient_test.cpp listing is obsolete
- Builds/levelization/results/ordering.txt: regenerated via
levelization.py (adds test.net > {xrpl.basics, xrpld.net, test.toplevel})
- src/test/net/HTTPClient_test.cpp: update ripple/ includes to xrpl//xrpld/
Backports the robustness changes that landed during review of the upstream rippled PR (XRPLF/rippled#6344) but never made it into this branch, plus the applicable Copilot suggestions. RPCSub::sendThread: - Wrap the whole dispatch loop in try/catch so mSending is reset under the lock on EVERY exit path (drain, dispatch throw, run() throw). Previously a throw from the lock-guarded dispatch block escaped sendThread() without clearing mSending, re-introducing the original stall (issue #6341) where send() never starts another job. - Bail out (reset mSending + return) on an io_service.run() exception instead of silently looping. RPCSub::send: - Rate-limit the queue-full drop warning (once per dropLogInterval) so a persistently behind endpoint can't flood the log; still advance mSeq on every drop so consumers detect the gap. HTTPClient::handleData: - Collapse the identical EOF/non-EOF completion branches into one and early-return on read error (per maintainer review). - Report success (not eof) to invokeComplete on the EOF-delimited path so callers don't treat a complete response as a failure.
Revert the eof->success translation from the previous commit. The upstream rippled #6344 deliberately keeps forwarding ecResult as-is; translating eof to success would report a truncated Content-Length response (server closes before sending the promised N body bytes) as a successful read and suppress HTTPClient::get() multi-site fallback. Keeps the structural simplification (collapsed branches, early return).
get() bound shared_from_this() into mBuild (a member), forming a this -> mBuild -> shared_ptr<this> cycle that never broke after the request completed — leaking the HTTPClientImp and its socket FD. mBuild is only ever invoked from handleRequest(), which always runs inside an async handler already holding a shared_from_this(), so a non-owning this is safe and lets the object be destroyed once the request finishes. Pre-existing (also present upstream in rippled); only the fetch path uses get(), not the webhook delivery path.
RPCSub_test (new): the webhook delivery flow control + EOF handling had zero automated coverage. Three cases drive events through a real local HTTP endpoint: delivery, delivery continues despite HTTP 500 responses (the #6341 stall, exercised via EOF-delimited error replies), and sending restarts after the queue drains (mSending reset). HTTPClient_test mock fixes: - heldSockets_: 'sendResponse=false' now genuinely holds the connection open instead of dropping the only socket shared_ptr (which closed it). testCleanupAfterTimeout now actually exercises the deadline path. - await-client-EOF on Content-Length responses: the mock decrements its connection count only when the CLIENT closes its socket, so activeConnectionCount() reflects client-side FD release rather than the server's own lifecycle — giving testGetSelfReferenceCleanup real teeth. - poll helpers replace fixed sleeps for the cross-thread count checks.
The RPCSub tests crashed/timed out under the instrumented Debug coverage build (where the full HTTP stack is ~50x slower than Release): - Deterministic teardown: wait on JobQueue::getJobCountTotal( jtCLIENT_SUBSCRIBE) == 0 before destroying the sub. sendThread captures a raw `this`; a fixed-sleep grace raced it under slow builds → UAF. - Lower per-test event counts (50 -> 10) so a handful of real HTTP deliveries finish well within the timeout under instrumentation. Queue-cap drop coverage: make_RPCSub gains a defaulted maxQueueSize parameter (production default unchanged at 16384) so a test can set a tiny cap and exercise the drop path by out-pacing delivery — enqueue is microsecond-fast while each delivery is a full round-trip — instead of queueing 16384 events. New testQueueCapDrops confirms some events are delivered and some dropped. Raises RPCSub.cpp patch coverage 62% -> 87% (overall diff 65% -> 87%); the only remaining uncovered lines are the defensive sendThread exception catches.
Two issues with sendThread, both surfaced by an independent review: UAF: the sending job captured a raw `this`. An admin Unsubscribe runs NetworkOPsImp::tryRemoveRpcSub, which erases the sub from mRpcSubMap under mSubLock and destroys it inline — while sendThread may be running on a job-queue worker under the *unrelated* InfoSub::mLock. The blocking io_service.run() (up to the 30s request deadline for a hung endpoint) makes the window large, and a hung endpoint is exactly this fix's scenario. RPCSubImp now derives enable_shared_from_this and the job captures weak_from_this(), re-locking it on entry so the sub stays alive for the batch; once it's gone the job is a no-op. Starvation: sendThread drained the entire backlog (up to 16384 events, 512 batches x up to 30s) in one job, blocking a worker thread the whole time. With a small pool shared with consensus/ledger/RPC, a few hung subscribers could occupy most workers. sendThread now processes ONE batch per job and re-queues if more remain, so the worker is released between batches and other jobs interleave.
RPCCall::fromNetwork hardcoded a 256MB read budget for responses with no Content-Length. Webhook deliveries (method "event") ignore the response body, but each in-flight delivery still pre-allocated a ~256MB streambuf on the EOF path — up to 32 (maxInFlight) x 256MB = 8GB transient, on exactly the no-Content-Length error shape this branch targets. Cap the "event" method at 1MB; genuine RPC replies (CLI) keep the 256MB budget.
The test never set noContentLength, so the mock sent Content-Length and completion happened in handleHeader, not the handleData EOF branch the test targets — it passed regardless of the fix. Force the no-Content- Length path so it covers what it claims.
Minor: the MockHTTPServer destructor closed heldSockets_ from the main thread while the server io_service thread could still be running. The held sockets have no pending async op so it was benign, but move the close to after the thread is joined to avoid the cross-thread access entirely (no lock needed once single-threaded).
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dev #677 +/- ##
=======================================
+ Coverage 77.3% 77.4% +0.1%
=======================================
Files 837 837
Lines 78575 78605 +30
Branches 11551 11506 -45
=======================================
+ Hits 60748 60832 +84
+ Misses 17817 17763 -54
Partials 10 10
🚀 New features to boost your workflow:
|
RPCSub_test.cpp includes xrpld/core/Job.h and JobQueue.h (for jtCLIENT_SUBSCRIBE and getJobCountTotal), adding a test.net > xrpld.core edge that wasn't regenerated when those includes were added.
With per-batch re-queue, sendThread is only ever started with a non-empty deque (send() enqueues before starting a job; the re-queue fires only when the deque is non-empty), so the dispatched==0 early return was unreachable. Guard the dispatch JLOG/run() with dispatched>0 instead, which falls through to clear mSending if a batch is ever empty. Removes an uncovered branch.
Add a mock mode that sends a header promising more body than it delivers then holds the socket, so the client blocks in handleData and hits its deadline — driving the read-error branch (mShutdown != eof) that no existing test exercised.
Codex review flagged the drop test as timing-dependent: a very fast endpoint could let sendThread drain/restart between send() calls, so the deque might not stay full and no drop would occur. Add a per-reply delay to the mock endpoint so delivery is reliably slower than the microsecond-fast enqueue loop — the deque stays at the cap and drops are deterministic (received is now consistently 2/50).
eaa2d1b to
1518121
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Last updated: 2026-06-15 | branch: subscription-hooks-fix | commit: 1518121 (Fix truncated RPC response handling)
Summary
Fix subscription webhook (
url) delivery permanently stalling when endpointsreturn HTTP errors (500, 404, etc.) or responses without
Content-Length,affecting all subscribers.
Upstream issue: XRPLF/rippled#6341
Upstream PR: XRPLF/rippled#6344
(this branch ports that work and adds further hardening — flagged
[beyond #6344] below, worth folding back upstream).
The stall — two root causes
Bug 1: RPCSub unbounded concurrency
sendThread()passed the app's sharedio_servicetofromNetwork(), whichposts async handlers and returns immediately — no
.run(), no waiting:📍
src/xrpld/net/detail/RPCCall.cpp:1901-1924The entire deque was drained at full speed, firing one async HTTP connection
per event. Under sustained errors, each connection held an FD for 30s. At
100+ events/ledger every ~4s, this quickly exhausted the 1024 FD budget.
Fix: a local
io_serviceper batch (same pattern asrpcClient())with bounded concurrency (32 in-flight) and a configurable queue cap
(16384 events by default), dropping on overflow while advancing
seqsoconsumers can detect the gap:
📍
src/xrpld/net/detail/RPCSub.cpp:77-115sendThread()processes one batch per job and re-queues if more remain,so a slow endpoint can't hold a worker for the whole backlog (see worker
starvation below), and resets
mSendingunder the lock on every exit pathso the queue can never wedge (the #6341 failure mode).
Bug 2: HTTPClient EOF completion leak
When an HTTP response has no
Content-Lengthheader,HTTPClientreads untilEOF. The EOF path in
handleDatalogged "Complete." but never calledinvokeComplete()— so the completion callback never fired, the deadlinetimer was never cancelled, and (on the shared
io_service) theHTTPClientImpwas kept alive by itsshared_from_this()captures. Manyframeworks omit
Content-Lengthon error responses, so failing webhookendpoints hit this constantly.
📍
src/xrpld/net/detail/HTTPClient.cpp:447-478The EOF path now delivers the body and invokes completion. EOF is translated to
success only for EOF-delimited responses with no
Content-Length; if a serverpromises
Content-Lengthand closes early, EOF remains a transport error.RPCCall::onResponsenow rejects nonzero transport errors before parsing apartial body as JSON, so CLI RPC cannot accept a truncated-but-parseable reply.
This also changes
HTTPClient::get()behavior for no-Content-Lengthresponses: a clean EOF-delimited body is now a success rather than a fallback
error.
Hardening [beyond #6344]
Use-after-free on concurrent unsubscribe
The sending job captured a raw
this. An adminUnsubscriberunsNetworkOPsImp::tryRemoveRpcSub, which erases the sub frommRpcSubMapundermSubLockand destroys it inline — whilesendThreadmay be running on ajob-queue worker under the unrelated
InfoSub::mLock. The blockingio_service.run()(up to the 30s request deadline for a hung endpoint) makesthe window large, and a hung endpoint is exactly this fix's scenario.
RPCSubImpnow derivesenable_shared_from_thisand the job capturesweak_from_this(), re-locking it on entry so the sub stays alive for thebatch; once it's gone the job is a no-op:
📍
src/xrpld/net/detail/RPCSub.cpp:151-160Job-queue worker starvation
Draining the entire backlog (up to 16384 events, 512 batches × up to 30s) in
one job blocked a worker thread the whole time; with a small pool shared with
consensus/ledger/RPC, a few hung subscribers could occupy most workers.
sendThreadnow processes one batch per job and re-queues if more remain,releasing the worker between batches.
HTTPClient::get()self-reference leakget()boundshared_from_this()intomBuild(a member), forming athis → mBuild → shared_ptr<this>cycle that never broke — leaking the objectand its socket FD after the request completed.
mBuildis only invoked fromhandleRequest(), which always runs inside ashared_from_this()-holdingasync handler, so a non-owning
thisis safe (pre-existing; also present inrippled).
Bounded webhook response buffer
fromNetworkhardcoded a 256MB read budget for responses with noContent-Length; each in-flight EOF-path delivery pre-allocated a ~256MBstreambuf (up to 32 × 256MB = 8GB transient). Webhook deliveries (method
"event") ignore the response body, so they are now capped at 1MB; genuineRPC replies (CLI) keep the 256MB budget.
Truncated CLI RPC replies rejected
RPCCallImp::onResponseused to parsestrDataeven whenHTTPClientreported a nonzero transport error. A server that declared
Content-Length,sent a shorter but valid JSON body, and closed early could therefore be
reported as a successful CLI RPC response.
HTTPClientnow preserves EOF as anerror for truncated
Content-Lengthreplies, andonResponserejects thaterror before parsing.
Tests & coverage
HTTPClient_test(17 cases): resource cleanup across success, HTTP 500,connection refused, timeout, server-close, concurrent requests,
EOF-without-Content-Length, read-error-mid-body, persistent
io_service,and the
get()self-reference cycle. It also covers EOF-without-Content-Lengthsuccess, truncatedContent-Lengtherror propagation, andboth the
RPCCall::fromNetworkhappy path and truncated-reply rejection. Themock waits for the client to close (EOF) before decrementing, so it
verifies client-side FD release, not just its own lifecycle.
RPCSub_test(4 cases): delivery, delivery continues despite HTTP 500responses (EOF-delimited — the #6341 stall, end to end), sending restarts
after the queue drains, and queue-cap drops. Teardown waits on
JobQueue::getJobCountTotal(jtCLIENT_SUBSCRIBE) == 0so the sub isn'tdestroyed while
sendThreadruns.Test Plan
ripple.net.HTTPClientpasses (17 cases, 0 failures)ripple.net.RPCSubpasses (4 cases, 0 failures)x-testnet create-config --network mainnet --hooks-serverx-testnet hooks-server --error 500:0.5(50% HTTP 500 responses)