fix: subscription webhook delivery stalling on HTTP errors by sublimator · Pull Request #677 · Xahau/xahaud

sublimator · 2026-02-10T00:38:13Z

_{Last updated: 2026-06-15 | branch: subscription-hooks-fix | commit: 1518121 (Fix truncated RPC response handling)}

Summary

Fix subscription webhook (url) delivery permanently stalling when endpoints
return HTTP errors (500, 404, etc.) or responses without Content-Length,
affecting all subscribers.

Upstream issue: XRPLF/rippled#6341
Upstream PR: XRPLF/rippled#6344
(this branch ports that work and adds further hardening — flagged
[beyond #6344] below, worth folding back upstream).

Paths moved from src/ripple/net/impl/ to src/xrpld/net/detail/ when
dev was merged in; snippets below are extracted from the current tree.

The stall — two root causes

Bug 1: RPCSub unbounded concurrency

sendThread() passed the app's shared io_service to fromNetwork(), which
posts async handlers and returns immediately — no .run(), no waiting:

📍 src/xrpld/net/detail/RPCCall.cpp:1901-1924

1901     HTTPClient::request(
1902         bSSL,
1903         io_service,
1904         strIp,
1905         iPort,
1906         std::bind(
1907             &RPCCallImp::onRequest,
1908             strMethod,
1909             jvParams,
1910             headers,
1911             strPath,
1912             std::placeholders::_1,
1913             std::placeholders::_2,
1914             j),
1915         RPC_REPLY_MAX_BYTES,
1916         RPC_NOTIFY,
1917         std::bind(
1918             &RPCCallImp::onResponse,
1919             callbackFuncP,
1920             std::placeholders::_1,
1921             std::placeholders::_2,
1922             std::placeholders::_3,
1923             j),
1924         j);

The entire deque was drained at full speed, firing one async HTTP connection
per event. Under sustained errors, each connection held an FD for 30s. At
100+ events/ledger every ~4s, this quickly exhausted the 1024 FD budget.

Fix: a local io_service per batch (same pattern as rpcClient())
with bounded concurrency (32 in-flight) and a configurable queue cap
(16384 events by default), dropping on overflow while advancing seq so
consumers can detect the gap:

📍 src/xrpld/net/detail/RPCSub.cpp:77-115

  77 void
  78     send(Json::Value const& jvObj, bool broadcast) override
  79     {
  80         std::lock_guard sl(mLock);
  81 
  82         if (mDeque.size() >= maxQueueSize_)
  83         {
  84             // Always advance mSeq so consumers can detect the gap, but
  85             // rate-limit the log: a hopelessly behind endpoint drops on
  86             // every send() and would otherwise flood the log. Warn on
  87             // the first drop of a run and then once per dropLogInterval.
  88             if (mDropped++ % dropLogInterval == 0)
  89             {
  90                 JLOG(j_.warn())
  91                     << "RPCCall::fromNetwork drop: queue full ("
  92                     << mDeque.size() << "), seq=" << mSeq
  93                     << ", endpoint=" << mIp << ", dropped=" << mDropped;
  94             }
  95             ++mSeq;
  96             return;
  97         }
  98 
  99         // Endpoint caught up enough to accept again; reset so the next
 100         // overflow burst logs its first drop immediately.
 101         mDropped = 0;
 102 
 103         auto jm = broadcast ? j_.debug() : j_.info();
 104         JLOG(jm) << "RPCCall::fromNetwork push: " << jvObj;
 105 
 106         mDeque.push_back(std::make_pair(mSeq++, jvObj));
 107 
 108         if (!mSending)
 109         {
 110             // Start a sending thread.
 111             JLOG(j_.info()) << "RPCCall::fromNetwork start";
 112 
 113             startSendingJob();
 114         }
 115     }

sendThread() processes one batch per job and re-queues if more remain,
so a slow endpoint can't hold a worker for the whole backlog (see worker
starvation below), and resets mSending under the lock on every exit path
so the queue can never wedge (the #6341 failure mode).

Bug 2: HTTPClient EOF completion leak

When an HTTP response has no Content-Length header, HTTPClient reads until
EOF. The EOF path in handleData logged "Complete." but never called
invokeComplete() — so the completion callback never fired, the deadline
timer was never cancelled, and (on the shared io_service) the
HTTPClientImp was kept alive by its shared_from_this() captures. Many
frameworks omit Content-Length on error responses, so failing webhook
endpoints hit this constantly.

📍 src/xrpld/net/detail/HTTPClient.cpp:447-478

 447     void
 448     handleData(
 449         const boost::system::error_code& ecResult,
 450         std::size_t bytes_transferred)
 451     {
 452         if (!mShutdown)
 453             mShutdown = ecResult;
 454 
 455         if (mShutdown && mShutdown != boost::asio::error::eof)
 456         {
 457             JLOG(j_.trace()) << "Read error: " << mShutdown.message();
 458 
 459             invokeComplete(mShutdown);
 460             return;
 461         }
 462 
 463         // Either the read completed normally or it ended at EOF. EOF is a
 464         // successful completion for EOF-delimited responses, but it is an
 465         // error when the server promised a Content-Length and closed early.
 466         JLOG(j_.trace()) << "Complete.";
 467 
 468         mResponse.commit(bytes_transferred);
 469         std::string strBody{
 470             {std::istreambuf_iterator<char>(&mResponse)},
 471             std::istreambuf_iterator<char>()};
 472 
 473         auto completeEc = ecResult;
 474         if (completeEc == boost::asio::error::eof && !mReceivedContentLength)
 475             completeEc.clear();
 476 
 477         invokeComplete(completeEc, mStatus, mBody + strBody);
 478     }

The EOF path now delivers the body and invokes completion. EOF is translated to
success only for EOF-delimited responses with no Content-Length; if a server
promises Content-Length and closes early, EOF remains a transport error.
RPCCall::onResponse now rejects nonzero transport errors before parsing a
partial body as JSON, so CLI RPC cannot accept a truncated-but-parseable reply.
This also changes HTTPClient::get() behavior for no-Content-Length
responses: a clean EOF-delimited body is now a success rather than a fallback
error.

Hardening [beyond #6344]

Use-after-free on concurrent unsubscribe

The sending job captured a raw this. An admin Unsubscribe runs
NetworkOPsImp::tryRemoveRpcSub, which erases the sub from mRpcSubMap under
mSubLock and destroys it inline — while sendThread may be running on a
job-queue worker under the unrelated InfoSub::mLock. The blocking
io_service.run() (up to the 30s request deadline for a hung endpoint) makes
the window large, and a hung endpoint is exactly this fix's scenario.
RPCSubImp now derives enable_shared_from_this and the job captures
weak_from_this(), re-locking it on entry so the sub stays alive for the
batch; once it's gone the job is a no-op:

📍 src/xrpld/net/detail/RPCSub.cpp:151-160

 151 void
 152     startSendingJob()
 153     {
 154         std::weak_ptr<RPCSubImp> weak = weak_from_this();
 155         mSending = m_jobQueue.addJob(
 156             jtCLIENT_SUBSCRIBE, "RPCSub::sendThread", [weak]() {
 157                 if (auto self = weak.lock())
 158                     self->sendThread();
 159             });
 160     }

Job-queue worker starvation

Draining the entire backlog (up to 16384 events, 512 batches × up to 30s) in
one job blocked a worker thread the whole time; with a small pool shared with
consensus/ledger/RPC, a few hung subscribers could occupy most workers.
sendThread now processes one batch per job and re-queues if more remain,
releasing the worker between batches.

`HTTPClient::get()` self-reference leak

get() bound shared_from_this() into mBuild (a member), forming a
this → mBuild → shared_ptr<this> cycle that never broke — leaking the object
and its socket FD after the request completed. mBuild is only invoked from
handleRequest(), which always runs inside a shared_from_this()-holding
async handler, so a non-owning this is safe (pre-existing; also present in
rippled).

Bounded webhook response buffer

fromNetwork hardcoded a 256MB read budget for responses with no
Content-Length; each in-flight EOF-path delivery pre-allocated a ~256MB
streambuf (up to 32 × 256MB = 8GB transient). Webhook deliveries (method
"event") ignore the response body, so they are now capped at 1MB; genuine
RPC replies (CLI) keep the 256MB budget.

Truncated CLI RPC replies rejected

RPCCallImp::onResponse used to parse strData even when HTTPClient
reported a nonzero transport error. A server that declared Content-Length,
sent a shorter but valid JSON body, and closed early could therefore be
reported as a successful CLI RPC response. HTTPClient now preserves EOF as an
error for truncated Content-Length replies, and onResponse rejects that
error before parsing.

Tests & coverage

HTTPClient_test (17 cases): resource cleanup across success, HTTP 500,
connection refused, timeout, server-close, concurrent requests,
EOF-without-Content-Length, read-error-mid-body, persistent io_service,
and the get() self-reference cycle. It also covers EOF-without-
Content-Length success, truncated Content-Length error propagation, and
both the RPCCall::fromNetwork happy path and truncated-reply rejection. The
mock waits for the client to close (EOF) before decrementing, so it
verifies client-side FD release, not just its own lifecycle.
RPCSub_test (4 cases): delivery, delivery continues despite HTTP 500
responses (EOF-delimited — the #6341 stall, end to end), sending restarts
after the queue drains, and queue-cap drops. Teardown waits on
JobQueue::getJobCountTotal(jtCLIENT_SUBSCRIBE) == 0 so the sub isn't
destroyed while sendThread runs.
Diff/patch coverage clears the codecov gate (gcc-13).

Test Plan

Build succeeds (Release + Debug/coverage)
ripple.net.HTTPClient passes (17 cases, 0 failures)
ripple.net.RPCSub passes (4 cases, 0 failures)
Connected to Xahau mainnet with x-testnet create-config --network mainnet --hooks-server
Ran x-testnet hooks-server --error 500:0.5 (50% HTTP 500 responses)
Confirmed events continue flowing through errors
Review: consider adding backoff/limit for repeatedly failing endpoints (follow-up)

fromNetwork() is async — it posts handlers to the io_service and returns immediately. The original sendThread() loop fires all queued events as concurrent HTTP connections at once. Under sustained load with a slow/failing endpoint, connections accumulate (each held up to 30s by RPC_NOTIFY timeout), exhausting file descriptors and breaking all network I/O for the entire process. Fix: use a local io_service per batch with .run() to block until the batch completes (same pattern as rpcClient() in RPCCall.cpp). This bounds concurrent connections to maxInFlight (32) per subscriber while still allowing parallel delivery. Also add a queue cap (maxQueueSize = 16384, ~80-160MB) so a hopelessly behind endpoint doesn't grow the deque indefinitely. Consumers detect gaps via the existing seq field. Ref: XRPLF/rippled#6341

When an HTTP response has no Content-Length header, HTTPClient reads until EOF. The EOF path in handleData logged "Complete." but never called invokeComplete(), leaving the socket held open for the full 30s deadline timeout and the completion callback never invoked. This is the likely root cause of webhook delivery permanently stalling after repeated 500 errors — many web frameworks omit Content-Length on error responses, triggering this path. Each leaked socket holds an FD for 30s, eventually exhausting the process FD budget. Includes HTTPClient_test with 12 test cases covering resource cleanup across success, error, timeout, connection-refused, concurrent request, and EOF-without-Content-Length scenarios.

- Advance mSeq when dropping events so consumers can detect gaps via sequence numbers, and log the dropped seq - Use ephemeral port (bind + close) instead of hardcoded 19999 for the connection-refused test to avoid false negatives on busy machines

Add test.net > ripple.basics dependency introduced by the new test file.

sendThread() now uses a local io_service per batch, so the app's io_service passed via make_RPCSub is dead code. Removes it from the header, constructor, factory function, and sole call site.

Resolves conflicts from the src/ripple -> src/xrpld + cmake restructure: - src/xrpld/net/RPCSub.h: take dev's xrpld/ include paths, keep our removal of the now-unused boost/asio/io_service.hpp include - Builds/CMake/RippledCore.cmake: accept dev's deletion; tests are now auto-discovered via GLOB_RECURSE over src/test, so the explicit HTTPClient_test.cpp listing is obsolete - Builds/levelization/results/ordering.txt: regenerated via levelization.py (adds test.net > {xrpl.basics, xrpld.net, test.toplevel}) - src/test/net/HTTPClient_test.cpp: update ripple/ includes to xrpl//xrpld/

Backports the robustness changes that landed during review of the upstream rippled PR (XRPLF/rippled#6344) but never made it into this branch, plus the applicable Copilot suggestions. RPCSub::sendThread: - Wrap the whole dispatch loop in try/catch so mSending is reset under the lock on EVERY exit path (drain, dispatch throw, run() throw). Previously a throw from the lock-guarded dispatch block escaped sendThread() without clearing mSending, re-introducing the original stall (issue #6341) where send() never starts another job. - Bail out (reset mSending + return) on an io_service.run() exception instead of silently looping. RPCSub::send: - Rate-limit the queue-full drop warning (once per dropLogInterval) so a persistently behind endpoint can't flood the log; still advance mSeq on every drop so consumers detect the gap. HTTPClient::handleData: - Collapse the identical EOF/non-EOF completion branches into one and early-return on read error (per maintainer review). - Report success (not eof) to invokeComplete on the EOF-delimited path so callers don't treat a complete response as a failure.

Revert the eof->success translation from the previous commit. The upstream rippled #6344 deliberately keeps forwarding ecResult as-is; translating eof to success would report a truncated Content-Length response (server closes before sending the promised N body bytes) as a successful read and suppress HTTPClient::get() multi-site fallback. Keeps the structural simplification (collapsed branches, early return).

get() bound shared_from_this() into mBuild (a member), forming a this -> mBuild -> shared_ptr<this> cycle that never broke after the request completed — leaking the HTTPClientImp and its socket FD. mBuild is only ever invoked from handleRequest(), which always runs inside an async handler already holding a shared_from_this(), so a non-owning this is safe and lets the object be destroyed once the request finishes. Pre-existing (also present upstream in rippled); only the fetch path uses get(), not the webhook delivery path.

RPCSub_test (new): the webhook delivery flow control + EOF handling had zero automated coverage. Three cases drive events through a real local HTTP endpoint: delivery, delivery continues despite HTTP 500 responses (the #6341 stall, exercised via EOF-delimited error replies), and sending restarts after the queue drains (mSending reset). HTTPClient_test mock fixes: - heldSockets_: 'sendResponse=false' now genuinely holds the connection open instead of dropping the only socket shared_ptr (which closed it). testCleanupAfterTimeout now actually exercises the deadline path. - await-client-EOF on Content-Length responses: the mock decrements its connection count only when the CLIENT closes its socket, so activeConnectionCount() reflects client-side FD release rather than the server's own lifecycle — giving testGetSelfReferenceCleanup real teeth. - poll helpers replace fixed sleeps for the cross-thread count checks.

The RPCSub tests crashed/timed out under the instrumented Debug coverage build (where the full HTTP stack is ~50x slower than Release): - Deterministic teardown: wait on JobQueue::getJobCountTotal( jtCLIENT_SUBSCRIBE) == 0 before destroying the sub. sendThread captures a raw `this`; a fixed-sleep grace raced it under slow builds → UAF. - Lower per-test event counts (50 -> 10) so a handful of real HTTP deliveries finish well within the timeout under instrumentation. Queue-cap drop coverage: make_RPCSub gains a defaulted maxQueueSize parameter (production default unchanged at 16384) so a test can set a tiny cap and exercise the drop path by out-pacing delivery — enqueue is microsecond-fast while each delivery is a full round-trip — instead of queueing 16384 events. New testQueueCapDrops confirms some events are delivered and some dropped. Raises RPCSub.cpp patch coverage 62% -> 87% (overall diff 65% -> 87%); the only remaining uncovered lines are the defensive sendThread exception catches.

Two issues with sendThread, both surfaced by an independent review: UAF: the sending job captured a raw `this`. An admin Unsubscribe runs NetworkOPsImp::tryRemoveRpcSub, which erases the sub from mRpcSubMap under mSubLock and destroys it inline — while sendThread may be running on a job-queue worker under the *unrelated* InfoSub::mLock. The blocking io_service.run() (up to the 30s request deadline for a hung endpoint) makes the window large, and a hung endpoint is exactly this fix's scenario. RPCSubImp now derives enable_shared_from_this and the job captures weak_from_this(), re-locking it on entry so the sub stays alive for the batch; once it's gone the job is a no-op. Starvation: sendThread drained the entire backlog (up to 16384 events, 512 batches x up to 30s) in one job, blocking a worker thread the whole time. With a small pool shared with consensus/ledger/RPC, a few hung subscribers could occupy most workers. sendThread now processes ONE batch per job and re-queues if more remain, so the worker is released between batches and other jobs interleave.

RPCCall::fromNetwork hardcoded a 256MB read budget for responses with no Content-Length. Webhook deliveries (method "event") ignore the response body, but each in-flight delivery still pre-allocated a ~256MB streambuf on the EOF path — up to 32 (maxInFlight) x 256MB = 8GB transient, on exactly the no-Content-Length error shape this branch targets. Cap the "event" method at 1MB; genuine RPC replies (CLI) keep the 256MB budget.

The test never set noContentLength, so the mock sent Content-Length and completion happened in handleHeader, not the handleData EOF branch the test targets — it passed regardless of the fix. Force the no-Content- Length path so it covers what it claims.

Minor: the MockHTTPServer destructor closed heldSockets_ from the main thread while the server io_service thread could still be running. The held sockets have no pending async op so it was benign, but move the close to after the thread is joined to avoid the cross-thread access entirely (no lock needed once single-threaded).

codecov · 2026-06-10T11:16:01Z

Codecov Report

❌ Patch coverage is 83.05085% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.4%. Comparing base (c55420b) to head (1518121).

Files with missing lines	Patch %	Lines
src/xrpld/net/detail/RPCSub.cpp	76.7%	10 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             dev    #677     +/-   ##
=======================================
+ Coverage   77.3%   77.4%   +0.1%     
=======================================
  Files        837     837             
  Lines      78575   78605     +30     
  Branches   11551   11506     -45     
=======================================
+ Hits       60748   60832     +84     
+ Misses     17817   17763     -54     
  Partials      10      10

Files with missing lines	Coverage Δ
src/xrpld/net/detail/HTTPClient.cpp	`84.7% <100.0%> (+16.5%)`	⬆️
src/xrpld/net/detail/RPCCall.cpp	`86.7% <100.0%> (+0.1%)`	⬆️
src/xrpld/rpc/handlers/Subscribe.cpp	`90.9% <ø> (-<0.1%)`	⬇️
src/xrpld/net/detail/RPCSub.cpp	`82.7% <76.7%> (+38.7%)`	⬆️

... and 12 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

RPCSub_test.cpp includes xrpld/core/Job.h and JobQueue.h (for jtCLIENT_SUBSCRIBE and getJobCountTotal), adding a test.net > xrpld.core edge that wasn't regenerated when those includes were added.

With per-batch re-queue, sendThread is only ever started with a non-empty deque (send() enqueues before starting a job; the re-queue fires only when the deque is non-empty), so the dispatched==0 early return was unreachable. Guard the dispatch JLOG/run() with dispatched>0 instead, which falls through to clear mSending if a batch is ever empty. Removes an uncovered branch.

Add a mock mode that sends a header promising more body than it delivers then holds the socket, so the client blocks in handleData and hits its deadline — driving the read-error branch (mShutdown != eof) that no existing test exercised.

Codex review flagged the drop test as timing-dependent: a very fast endpoint could let sendThread drain/restart between send() calls, so the deque might not stay full and no drop would occur. Add a per-reply delay to the mock endpoint so delivery is reliably slower than the microsecond-fast enqueue loop — the deque stays at the cap and drops are deterministic (received is now consistently 2/50).

sublimator added 4 commits February 9, 2026 18:31

chore: update levelization ordering for HTTPClient_test

a6dfb40

Add test.net > ripple.basics dependency introduced by the new test file.

sublimator marked this pull request as ready for review February 10, 2026 01:32

refactor: remove unused io_service parameter from RPCSub

7241fed

sendThread() now uses a local io_service per batch, so the app's io_service passed via make_RPCSub is dead code. Removes it from the header, constructor, factory function, and sole call site.

RichardAH added the has conflicts label Apr 29, 2026

sublimator added 10 commits June 9, 2026 14:53

sublimator removed the has conflicts label Jun 11, 2026

sublimator added 5 commits June 11, 2026 13:43

chore: update levelization ordering for RPCSub_test xrpld.core deps

db7ecfb

RPCSub_test.cpp includes xrpld/core/Job.h and JobQueue.h (for jtCLIENT_SUBSCRIBE and getJobCountTotal), adding a test.net > xrpld.core edge that wasn't regenerated when those includes were added.

Fix truncated RPC response handling

1518121

sublimator force-pushed the subscription-hooks-fix branch from eaa2d1b to 1518121 Compare June 15, 2026 06:29

Merge remote-tracking branch 'origin/dev' into subscription-hooks-fix

93910f6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: subscription webhook delivery stalling on HTTP errors#677

fix: subscription webhook delivery stalling on HTTP errors#677
sublimator wants to merge 21 commits into
devfrom
subscription-hooks-fix

sublimator commented Feb 10, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

sublimator commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The stall — two root causes

Bug 1: RPCSub unbounded concurrency

Bug 2: HTTPClient EOF completion leak

Hardening [beyond #6344]

Use-after-free on concurrent unsubscribe

Job-queue worker starvation

HTTPClient::get() self-reference leak

Bounded webhook response buffer

Truncated CLI RPC replies rejected

Tests & coverage

Test Plan

Uh oh!

codecov Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sublimator commented Feb 10, 2026 •

edited

Loading

`HTTPClient::get()` self-reference leak

codecov Bot commented Jun 10, 2026 •

edited

Loading