Skip to content

fix: subscription webhook delivery stalling on HTTP errors#677

Open
sublimator wants to merge 21 commits into
devfrom
subscription-hooks-fix
Open

fix: subscription webhook delivery stalling on HTTP errors#677
sublimator wants to merge 21 commits into
devfrom
subscription-hooks-fix

Conversation

@sublimator

@sublimator sublimator commented Feb 10, 2026

Copy link
Copy Markdown
Collaborator

Last updated: 2026-06-15 | branch: subscription-hooks-fix | commit: 1518121 (Fix truncated RPC response handling)


Summary

Fix subscription webhook (url) delivery permanently stalling when endpoints
return HTTP errors (500, 404, etc.) or responses without Content-Length,
affecting all subscribers.

Upstream issue: XRPLF/rippled#6341
Upstream PR: XRPLF/rippled#6344
(this branch ports that work and adds further hardening — flagged
[beyond #6344] below, worth folding back upstream).

Paths moved from src/ripple/net/impl/ to src/xrpld/net/detail/ when
dev was merged in; snippets below are extracted from the current tree.

The stall — two root causes

Bug 1: RPCSub unbounded concurrency

sendThread() passed the app's shared io_service to fromNetwork(), which
posts async handlers and returns immediately — no .run(), no waiting:

📍 src/xrpld/net/detail/RPCCall.cpp:1901-1924

1901     HTTPClient::request(
1902         bSSL,
1903         io_service,
1904         strIp,
1905         iPort,
1906         std::bind(
1907             &RPCCallImp::onRequest,
1908             strMethod,
1909             jvParams,
1910             headers,
1911             strPath,
1912             std::placeholders::_1,
1913             std::placeholders::_2,
1914             j),
1915         RPC_REPLY_MAX_BYTES,
1916         RPC_NOTIFY,
1917         std::bind(
1918             &RPCCallImp::onResponse,
1919             callbackFuncP,
1920             std::placeholders::_1,
1921             std::placeholders::_2,
1922             std::placeholders::_3,
1923             j),
1924         j);

The entire deque was drained at full speed, firing one async HTTP connection
per event. Under sustained errors, each connection held an FD for 30s. At
100+ events/ledger every ~4s, this quickly exhausted the 1024 FD budget.

Fix: a local io_service per batch (same pattern as rpcClient())
with bounded concurrency (32 in-flight) and a configurable queue cap
(16384 events by default), dropping on overflow while advancing seq so
consumers can detect the gap:

📍 src/xrpld/net/detail/RPCSub.cpp:77-115

  77 void
  78     send(Json::Value const& jvObj, bool broadcast) override
  79     {
  80         std::lock_guard sl(mLock);
  81 
  82         if (mDeque.size() >= maxQueueSize_)
  83         {
  84             // Always advance mSeq so consumers can detect the gap, but
  85             // rate-limit the log: a hopelessly behind endpoint drops on
  86             // every send() and would otherwise flood the log. Warn on
  87             // the first drop of a run and then once per dropLogInterval.
  88             if (mDropped++ % dropLogInterval == 0)
  89             {
  90                 JLOG(j_.warn())
  91                     << "RPCCall::fromNetwork drop: queue full ("
  92                     << mDeque.size() << "), seq=" << mSeq
  93                     << ", endpoint=" << mIp << ", dropped=" << mDropped;
  94             }
  95             ++mSeq;
  96             return;
  97         }
  98 
  99         // Endpoint caught up enough to accept again; reset so the next
 100         // overflow burst logs its first drop immediately.
 101         mDropped = 0;
 102 
 103         auto jm = broadcast ? j_.debug() : j_.info();
 104         JLOG(jm) << "RPCCall::fromNetwork push: " << jvObj;
 105 
 106         mDeque.push_back(std::make_pair(mSeq++, jvObj));
 107 
 108         if (!mSending)
 109         {
 110             // Start a sending thread.
 111             JLOG(j_.info()) << "RPCCall::fromNetwork start";
 112 
 113             startSendingJob();
 114         }
 115     }

sendThread() processes one batch per job and re-queues if more remain,
so a slow endpoint can't hold a worker for the whole backlog (see worker
starvation
below), and resets mSending under the lock on every exit path
so the queue can never wedge (the #6341 failure mode).

Bug 2: HTTPClient EOF completion leak

When an HTTP response has no Content-Length header, HTTPClient reads until
EOF. The EOF path in handleData logged "Complete." but never called
invokeComplete()
— so the completion callback never fired, the deadline
timer was never cancelled, and (on the shared io_service) the
HTTPClientImp was kept alive by its shared_from_this() captures. Many
frameworks omit Content-Length on error responses, so failing webhook
endpoints hit this constantly.

📍 src/xrpld/net/detail/HTTPClient.cpp:447-478

 447     void
 448     handleData(
 449         const boost::system::error_code& ecResult,
 450         std::size_t bytes_transferred)
 451     {
 452         if (!mShutdown)
 453             mShutdown = ecResult;
 454 
 455         if (mShutdown && mShutdown != boost::asio::error::eof)
 456         {
 457             JLOG(j_.trace()) << "Read error: " << mShutdown.message();
 458 
 459             invokeComplete(mShutdown);
 460             return;
 461         }
 462 
 463         // Either the read completed normally or it ended at EOF. EOF is a
 464         // successful completion for EOF-delimited responses, but it is an
 465         // error when the server promised a Content-Length and closed early.
 466         JLOG(j_.trace()) << "Complete.";
 467 
 468         mResponse.commit(bytes_transferred);
 469         std::string strBody{
 470             {std::istreambuf_iterator<char>(&mResponse)},
 471             std::istreambuf_iterator<char>()};
 472 
 473         auto completeEc = ecResult;
 474         if (completeEc == boost::asio::error::eof && !mReceivedContentLength)
 475             completeEc.clear();
 476 
 477         invokeComplete(completeEc, mStatus, mBody + strBody);
 478     }

The EOF path now delivers the body and invokes completion. EOF is translated to
success only for EOF-delimited responses with no Content-Length; if a server
promises Content-Length and closes early, EOF remains a transport error.
RPCCall::onResponse now rejects nonzero transport errors before parsing a
partial body as JSON, so CLI RPC cannot accept a truncated-but-parseable reply.
This also changes HTTPClient::get() behavior for no-Content-Length
responses: a clean EOF-delimited body is now a success rather than a fallback
error.

Hardening [beyond #6344]

Use-after-free on concurrent unsubscribe

The sending job captured a raw this. An admin Unsubscribe runs
NetworkOPsImp::tryRemoveRpcSub, which erases the sub from mRpcSubMap under
mSubLock and destroys it inline — while sendThread may be running on a
job-queue worker under the unrelated InfoSub::mLock. The blocking
io_service.run() (up to the 30s request deadline for a hung endpoint) makes
the window large, and a hung endpoint is exactly this fix's scenario.
RPCSubImp now derives enable_shared_from_this and the job captures
weak_from_this(), re-locking it on entry so the sub stays alive for the
batch; once it's gone the job is a no-op:

📍 src/xrpld/net/detail/RPCSub.cpp:151-160

 151 void
 152     startSendingJob()
 153     {
 154         std::weak_ptr<RPCSubImp> weak = weak_from_this();
 155         mSending = m_jobQueue.addJob(
 156             jtCLIENT_SUBSCRIBE, "RPCSub::sendThread", [weak]() {
 157                 if (auto self = weak.lock())
 158                     self->sendThread();
 159             });
 160     }

Job-queue worker starvation

Draining the entire backlog (up to 16384 events, 512 batches × up to 30s) in
one job blocked a worker thread the whole time; with a small pool shared with
consensus/ledger/RPC, a few hung subscribers could occupy most workers.
sendThread now processes one batch per job and re-queues if more remain,
releasing the worker between batches.

HTTPClient::get() self-reference leak

get() bound shared_from_this() into mBuild (a member), forming a
this → mBuild → shared_ptr<this> cycle that never broke — leaking the object
and its socket FD after the request completed. mBuild is only invoked from
handleRequest(), which always runs inside a shared_from_this()-holding
async handler, so a non-owning this is safe (pre-existing; also present in
rippled).

Bounded webhook response buffer

fromNetwork hardcoded a 256MB read budget for responses with no
Content-Length; each in-flight EOF-path delivery pre-allocated a ~256MB
streambuf (up to 32 × 256MB = 8GB transient). Webhook deliveries (method
"event") ignore the response body, so they are now capped at 1MB; genuine
RPC replies (CLI) keep the 256MB budget.

Truncated CLI RPC replies rejected

RPCCallImp::onResponse used to parse strData even when HTTPClient
reported a nonzero transport error. A server that declared Content-Length,
sent a shorter but valid JSON body, and closed early could therefore be
reported as a successful CLI RPC response. HTTPClient now preserves EOF as an
error for truncated Content-Length replies, and onResponse rejects that
error before parsing.

Tests & coverage

  • HTTPClient_test (17 cases): resource cleanup across success, HTTP 500,
    connection refused, timeout, server-close, concurrent requests,
    EOF-without-Content-Length, read-error-mid-body, persistent io_service,
    and the get() self-reference cycle. It also covers EOF-without-
    Content-Length success, truncated Content-Length error propagation, and
    both the RPCCall::fromNetwork happy path and truncated-reply rejection. The
    mock waits for the client to close (EOF) before decrementing, so it
    verifies client-side FD release, not just its own lifecycle.
  • RPCSub_test (4 cases): delivery, delivery continues despite HTTP 500
    responses (EOF-delimited — the #6341 stall, end to end), sending restarts
    after the queue drains, and queue-cap drops. Teardown waits on
    JobQueue::getJobCountTotal(jtCLIENT_SUBSCRIBE) == 0 so the sub isn't
    destroyed while sendThread runs.
  • Diff/patch coverage clears the codecov gate (gcc-13).

Test Plan

  • Build succeeds (Release + Debug/coverage)
  • ripple.net.HTTPClient passes (17 cases, 0 failures)
  • ripple.net.RPCSub passes (4 cases, 0 failures)
  • Connected to Xahau mainnet with x-testnet create-config --network mainnet --hooks-server
  • Ran x-testnet hooks-server --error 500:0.5 (50% HTTP 500 responses)
  • Confirmed events continue flowing through errors
  • Review: consider adding backoff/limit for repeatedly failing endpoints (follow-up)

fromNetwork() is async — it posts handlers to the io_service and
returns immediately. The original sendThread() loop fires all queued
events as concurrent HTTP connections at once. Under sustained load
with a slow/failing endpoint, connections accumulate (each held up to
30s by RPC_NOTIFY timeout), exhausting file descriptors and breaking
all network I/O for the entire process.

Fix: use a local io_service per batch with .run() to block until the
batch completes (same pattern as rpcClient() in RPCCall.cpp). This
bounds concurrent connections to maxInFlight (32) per subscriber while
still allowing parallel delivery.

Also add a queue cap (maxQueueSize = 16384, ~80-160MB) so a hopelessly
behind endpoint doesn't grow the deque indefinitely. Consumers detect
gaps via the existing seq field.

Ref: XRPLF/rippled#6341
When an HTTP response has no Content-Length header, HTTPClient reads
until EOF. The EOF path in handleData logged "Complete." but never
called invokeComplete(), leaving the socket held open for the full
30s deadline timeout and the completion callback never invoked.

This is the likely root cause of webhook delivery permanently stalling
after repeated 500 errors — many web frameworks omit Content-Length on
error responses, triggering this path. Each leaked socket holds an FD
for 30s, eventually exhausting the process FD budget.

Includes HTTPClient_test with 12 test cases covering resource cleanup
across success, error, timeout, connection-refused, concurrent request,
and EOF-without-Content-Length scenarios.
- Advance mSeq when dropping events so consumers can detect gaps via
  sequence numbers, and log the dropped seq
- Use ephemeral port (bind + close) instead of hardcoded 19999 for the
  connection-refused test to avoid false negatives on busy machines
Add test.net > ripple.basics dependency introduced by the new test file.
@sublimator sublimator marked this pull request as ready for review February 10, 2026 01:32
sendThread() now uses a local io_service per batch, so the app's
io_service passed via make_RPCSub is dead code. Removes it from the
header, constructor, factory function, and sole call site.
sublimator added 10 commits June 9, 2026 14:53
Resolves conflicts from the src/ripple -> src/xrpld + cmake restructure:
- src/xrpld/net/RPCSub.h: take dev's xrpld/ include paths, keep our
  removal of the now-unused boost/asio/io_service.hpp include
- Builds/CMake/RippledCore.cmake: accept dev's deletion; tests are now
  auto-discovered via GLOB_RECURSE over src/test, so the explicit
  HTTPClient_test.cpp listing is obsolete
- Builds/levelization/results/ordering.txt: regenerated via
  levelization.py (adds test.net > {xrpl.basics, xrpld.net, test.toplevel})
- src/test/net/HTTPClient_test.cpp: update ripple/ includes to xrpl//xrpld/
Backports the robustness changes that landed during review of the
upstream rippled PR (XRPLF/rippled#6344) but never made it into this
branch, plus the applicable Copilot suggestions.

RPCSub::sendThread:
- Wrap the whole dispatch loop in try/catch so mSending is reset under
  the lock on EVERY exit path (drain, dispatch throw, run() throw).
  Previously a throw from the lock-guarded dispatch block escaped
  sendThread() without clearing mSending, re-introducing the original
  stall (issue #6341) where send() never starts another job.
- Bail out (reset mSending + return) on an io_service.run() exception
  instead of silently looping.

RPCSub::send:
- Rate-limit the queue-full drop warning (once per dropLogInterval) so
  a persistently behind endpoint can't flood the log; still advance
  mSeq on every drop so consumers detect the gap.

HTTPClient::handleData:
- Collapse the identical EOF/non-EOF completion branches into one and
  early-return on read error (per maintainer review).
- Report success (not eof) to invokeComplete on the EOF-delimited path
  so callers don't treat a complete response as a failure.
Revert the eof->success translation from the previous commit. The
upstream rippled #6344 deliberately keeps forwarding ecResult as-is;
translating eof to success would report a truncated Content-Length
response (server closes before sending the promised N body bytes) as a
successful read and suppress HTTPClient::get() multi-site fallback.
Keeps the structural simplification (collapsed branches, early return).
get() bound shared_from_this() into mBuild (a member), forming a
this -> mBuild -> shared_ptr<this> cycle that never broke after the
request completed — leaking the HTTPClientImp and its socket FD. mBuild
is only ever invoked from handleRequest(), which always runs inside an
async handler already holding a shared_from_this(), so a non-owning
this is safe and lets the object be destroyed once the request finishes.

Pre-existing (also present upstream in rippled); only the fetch path
uses get(), not the webhook delivery path.
RPCSub_test (new): the webhook delivery flow control + EOF handling had
zero automated coverage. Three cases drive events through a real local
HTTP endpoint: delivery, delivery continues despite HTTP 500 responses
(the #6341 stall, exercised via EOF-delimited error replies), and
sending restarts after the queue drains (mSending reset).

HTTPClient_test mock fixes:
- heldSockets_: 'sendResponse=false' now genuinely holds the connection
  open instead of dropping the only socket shared_ptr (which closed it).
  testCleanupAfterTimeout now actually exercises the deadline path.
- await-client-EOF on Content-Length responses: the mock decrements its
  connection count only when the CLIENT closes its socket, so
  activeConnectionCount() reflects client-side FD release rather than the
  server's own lifecycle — giving testGetSelfReferenceCleanup real teeth.
- poll helpers replace fixed sleeps for the cross-thread count checks.
The RPCSub tests crashed/timed out under the instrumented Debug coverage
build (where the full HTTP stack is ~50x slower than Release):

- Deterministic teardown: wait on JobQueue::getJobCountTotal(
  jtCLIENT_SUBSCRIBE) == 0 before destroying the sub. sendThread captures
  a raw `this`; a fixed-sleep grace raced it under slow builds → UAF.
- Lower per-test event counts (50 -> 10) so a handful of real HTTP
  deliveries finish well within the timeout under instrumentation.

Queue-cap drop coverage: make_RPCSub gains a defaulted maxQueueSize
parameter (production default unchanged at 16384) so a test can set a
tiny cap and exercise the drop path by out-pacing delivery — enqueue is
microsecond-fast while each delivery is a full round-trip — instead of
queueing 16384 events. New testQueueCapDrops confirms some events are
delivered and some dropped.

Raises RPCSub.cpp patch coverage 62% -> 87% (overall diff 65% -> 87%);
the only remaining uncovered lines are the defensive sendThread
exception catches.
Two issues with sendThread, both surfaced by an independent review:

UAF: the sending job captured a raw `this`. An admin Unsubscribe runs
NetworkOPsImp::tryRemoveRpcSub, which erases the sub from mRpcSubMap
under mSubLock and destroys it inline — while sendThread may be running
on a job-queue worker under the *unrelated* InfoSub::mLock. The blocking
io_service.run() (up to the 30s request deadline for a hung endpoint)
makes the window large, and a hung endpoint is exactly this fix's
scenario. RPCSubImp now derives enable_shared_from_this and the job
captures weak_from_this(), re-locking it on entry so the sub stays alive
for the batch; once it's gone the job is a no-op.

Starvation: sendThread drained the entire backlog (up to 16384 events,
512 batches x up to 30s) in one job, blocking a worker thread the whole
time. With a small pool shared with consensus/ledger/RPC, a few hung
subscribers could occupy most workers. sendThread now processes ONE
batch per job and re-queues if more remain, so the worker is released
between batches and other jobs interleave.
RPCCall::fromNetwork hardcoded a 256MB read budget for responses with no
Content-Length. Webhook deliveries (method "event") ignore the response
body, but each in-flight delivery still pre-allocated a ~256MB streambuf
on the EOF path — up to 32 (maxInFlight) x 256MB = 8GB transient, on
exactly the no-Content-Length error shape this branch targets. Cap the
"event" method at 1MB; genuine RPC replies (CLI) keep the 256MB budget.
The test never set noContentLength, so the mock sent Content-Length and
completion happened in handleHeader, not the handleData EOF branch the
test targets — it passed regardless of the fix. Force the no-Content-
Length path so it covers what it claims.
Minor: the MockHTTPServer destructor closed heldSockets_ from the main
thread while the server io_service thread could still be running. The
held sockets have no pending async op so it was benign, but move the
close to after the thread is joined to avoid the cross-thread access
entirely (no lock needed once single-threaded).
@codecov

codecov Bot commented Jun 10, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 83.05085% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.4%. Comparing base (c55420b) to head (1518121).

Files with missing lines Patch % Lines
src/xrpld/net/detail/RPCSub.cpp 76.7% 10 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             dev    #677     +/-   ##
=======================================
+ Coverage   77.3%   77.4%   +0.1%     
=======================================
  Files        837     837             
  Lines      78575   78605     +30     
  Branches   11551   11506     -45     
=======================================
+ Hits       60748   60832     +84     
+ Misses     17817   17763     -54     
  Partials      10      10             
Files with missing lines Coverage Δ
src/xrpld/net/detail/HTTPClient.cpp 84.7% <100.0%> (+16.5%) ⬆️
src/xrpld/net/detail/RPCCall.cpp 86.7% <100.0%> (+0.1%) ⬆️
src/xrpld/rpc/handlers/Subscribe.cpp 90.9% <ø> (-<0.1%) ⬇️
src/xrpld/net/detail/RPCSub.cpp 82.7% <76.7%> (+38.7%) ⬆️

... and 12 files with indirect coverage changes

Impacted file tree graph

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

RPCSub_test.cpp includes xrpld/core/Job.h and JobQueue.h (for
jtCLIENT_SUBSCRIBE and getJobCountTotal), adding a test.net > xrpld.core
edge that wasn't regenerated when those includes were added.
With per-batch re-queue, sendThread is only ever started with a
non-empty deque (send() enqueues before starting a job; the re-queue
fires only when the deque is non-empty), so the dispatched==0 early
return was unreachable. Guard the dispatch JLOG/run() with dispatched>0
instead, which falls through to clear mSending if a batch is ever empty.
Removes an uncovered branch.
Add a mock mode that sends a header promising more body than it
delivers then holds the socket, so the client blocks in handleData and
hits its deadline — driving the read-error branch (mShutdown != eof)
that no existing test exercised.
Codex review flagged the drop test as timing-dependent: a very fast
endpoint could let sendThread drain/restart between send() calls, so the
deque might not stay full and no drop would occur. Add a per-reply delay
to the mock endpoint so delivery is reliably slower than the
microsecond-fast enqueue loop — the deque stays at the cap and drops are
deterministic (received is now consistently 2/50).
@sublimator sublimator force-pushed the subscription-hooks-fix branch from eaa2d1b to 1518121 Compare June 15, 2026 06:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants