Fix CLOSE_WAIT socket leak in monitor's close_mysql() when pvio is NULL#5854
Fix CLOSE_WAIT socket leak in monitor's close_mysql() when pvio is NULL#5854atifceylan wants to merge 2 commits into
Conversation
📝 WalkthroughWalkthroughThis PR closes and invalidates socket file descriptors in two close paths when ChangesClose-path FD leak prevention
Estimated code review effort: 2 (Simple) | ~10 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces changes to prevent CLOSE_WAIT leaks in both lib/MySQL_Monitor.cpp and lib/mysql_connection.cpp by directly closing the file descriptor when the pvio structure has already been cleared. However, a critical bug was identified in lib/MySQL_Monitor.cpp where the else if condition lacks the !my->net.pvio check. This omission can trigger a double-close of the file descriptor when SSL is enabled, potentially causing silent file descriptor corruption in multi-threaded environments. It is recommended to update the condition to match the correct implementation in lib/mysql_connection.cpp.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| #endif | ||
| fd+=wb; // dummy, to make compiler happy | ||
| fd-=wb; // dummy, to make compiler happy | ||
| } else if (my->net.fd > 0) { |
There was a problem hiding this comment.
In lib/MySQL_Monitor.cpp, the else if condition lacks the !my->net.pvio check. If my->net.pvio is valid but my->options.use_ssl is true, the if condition if (my->net.pvio && !my->options.use_ssl) evaluates to false, and the code falls into this else if block.
This causes the file descriptor to be closed directly via close(my->net.fd) while my->net.pvio is still active. Subsequently, mysql_close_no_command(my) is called, which triggers end_server(my) -> vio_delete(my->net.pvio). This leads to a double-close of the file descriptor, which is a critical bug in multi-threaded programs as it can silently close a newly opened file descriptor belonging to another thread.
Please update the condition to match the correct implementation in lib/mysql_connection.cpp by adding && !my->net.pvio.
} else if (my->net.fd > 0 && !my->net.pvio) {There was a problem hiding this comment.
Changed fd > 0 to fd != INVALID_SOCKET in both paths (covers fd 0 edge case, per CodeRabbit)
|
Thank you! |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@lib/MySQL_Monitor.cpp`:
- Around line 208-211: The condition that checks socket validity uses
"my->net.fd > 0" which misses descriptor 0; change the branch to compare against
INVALID_SOCKET (e.g., "if (my->net.fd != INVALID_SOCKET)") so the fallback path
that closes the FD runs for FD 0 as well, then call close(my->net.fd) and set
my->net.fd = INVALID_SOCKET (keeping the existing comment about pvio already
cleared).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: cfadfdb1-b57b-414c-a858-4eaf5d4c4b76
📒 Files selected for processing (2)
lib/MySQL_Monitor.cpplib/mysql_connection.cpp
📜 Review details
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.{cpp,h,hpp}: Feature tiers are controlled via flags:PROXYSQL31=1for v3.1.x features (FFTO, TSDB),PROXYSQL40=1for v4.0.x features (plugin loader).PROXYSQL40=1implies bothPROXYSQL31=1andPROXYSQLFFTO=1andPROXYSQLTSDB=1. Use conditional compilation with#ifdef PROXYSQL31,#ifdef PROXYSQL40,#ifdef PROXYSQLFFTO,#ifdef PROXYSQLTSDB,#ifdef PROXYSQLCLICKHOUSE.
Class names usePascalCasewith protocol prefixes:MySQL_,PgSQL_, orProxySQL_(e.g.,MySQL_Protocol,PgSQL_Session).
Member variables usesnake_case.
Constants and macros useUPPER_SNAKE_CASE.
Use C++17; conditional compilation for feature tiers via#ifdef PROXYSQL31,#ifdef PROXYSQL40,#ifdef PROXYSQLFFTO,#ifdef PROXYSQLTSDB,#ifdef PROXYSQLCLICKHOUSE.
Use RAII for resource management; use jemalloc for memory allocation.
Use pthread mutexes for synchronization; usestd::atomic<>for counters.
Files:
lib/mysql_connection.cpplib/MySQL_Monitor.cpp
{lib,src}/**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (CLAUDE.md)
GenAI/MCP/RAG/LLM features live entirely in
plugins/genai/and load as a.soat runtime viadlopen. Do not guard withPROXYSQLGENAIin core code — that flag no longer guards any core code as of Step 7 of the GenAI plugin carve-out.
Files:
lib/mysql_connection.cpplib/MySQL_Monitor.cpp
🔇 Additional comments (1)
lib/mysql_connection.cpp (1)
3022-3026: LGTM!
2beb829 to
4c445dc
Compare
Thanks! I'm working on the regression test. Taking |
When a backend server is removed (e.g. an AWS Aurora read replica is deleted) the remote end sends a TCP FIN. The MariaDB connector processes this FIN during the next async I/O operation and clears net.pvio before ProxySQL gets a chance to call close_mysql(). close_mysql() only sends COM_QUIT (which triggers end_server() -> close(fd) inside mysql_close_no_command()) when net.pvio != NULL. With pvio already NULL, mysql_close_no_command() skips end_server() entirely, so the kernel-level TCP socket is never closed. The socket remains in CLOSE_WAIT indefinitely. The monitor threads (ping, read_only, connect) retry every mysql-monitor_ping_interval ms. Each failed retry creates a new connection attempt; when that attempt also fails with pvio=NULL the same path is taken, leaving one more CLOSE_WAIT socket behind. Over time the count grows steadily until proxysql is restarted, at which point all file descriptors are released. Fix: in close_mysql(), add an else-if branch that detects the pvio=NULL / net.fd>0 state and explicitly calls shutdown(SHUT_RDWR) + close(fd) before delegating to mysql_close_no_command(). This mirrors what end_server() would have done and guarantees the kernel socket is released regardless of whether the MariaDB connector has already cleared pvio. Reproducer: 1. Configure ProxySQL with AWS Aurora MySQL (writer + 2 read replicas). 2. Monitor socket count: watch -n5 'grep "^TCP:" /proc/net/sockstat' 3. Delete one read replica from the Aurora console. 4. Observe TCP socket count growing by ~1 per monitor_ping_interval. 5. proxysql restart -> count returns to baseline. After this fix the count stays flat after replica deletion. Related issues: sysown#2908, sysown#3022
4c445dc to
1c49c8d
Compare
Adds reg_test_5854_close_wait_leak-t.cpp covering the pvio=NULL / net.fd>0 state that the MariaDB connector reaches when a backend sends a TCP FIN during async I/O. - test_unfixed_path_leaks_fd: mysql_close_no_command() skips end_server() when pvio=NULL, so the fd is never released (fd count stays above baseline) — demonstrates the bug. - test_fixed_path_closes_fd: the explicit close(fd) added in close_mysql() releases the fd (count returns to baseline) — demonstrates the fix. - test_no_close_wait_growth (Linux only): repeated pvio=NULL closes do not accumulate CLOSE_WAIT sockets. Connects directly over TCP (MYSQL_OPT_PROTOCOL) and reads /proc/self/fd and /proc/net/tcp for measurement; fd/CLOSE_WAIT checks are guarded so the test skips cleanly on non-Linux hosts. Related issues: sysown#2908, sysown#3022
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
test/tap/tests/reg_test_5854_close_wait_leak-t.cpp (1)
1-232: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick winRegister this TAP test in
test/tap/groups/groups.json.
reg_test_5854_close_wait_leak-t.cppis not listed, so it won’t be picked up by the TAP test groups.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/tap/tests/reg_test_5854_close_wait_leak-t.cpp` around lines 1 - 232, The new TAP regression test is implemented in reg_test_5854_close_wait_leak-t.cpp but is not included in the TAP group registry, so it will not run. Add this test to the appropriate entry in groups.json so the TAP harness discovers it, using the reg_test_5854_close_wait_leak-t.cpp filename as the identifier to locate it.Source: Coding guidelines
🧹 Nitpick comments (2)
test/tap/tests/reg_test_5854_close_wait_leak-t.cpp (2)
186-213: 🩺 Stability & Availability | 🔵 Trivial | 💤 Low valueFixed
usleep(100ms)per iteration is a flakiness risk on loaded CI hosts.The reclaim delay is a magic constant with only a
+1slack for CLOSE_WAIT growth. On a busy/throttled CI runner, kernel socket teardown could still be pending after 100ms, causing sporadic false failures.♻️ Possible refactor: poll with a timeout instead of a fixed sleep
- mysql_close_no_command(my); - usleep(100000); // 100 ms — let kernel reclaim + mysql_close_no_command(my); } + // Poll for CLOSE_WAIT count to settle instead of a single fixed sleep + for (int waited = 0; waited < 2000; waited += 100) { + if (count_close_wait() <= cw_before + 1) break; + usleep(100000); + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/tap/tests/reg_test_5854_close_wait_leak-t.cpp` around lines 186 - 213, The fixed 100ms usleep in the close-wait leak test is too brittle for loaded CI runners. Update the loop in reg_test_5854_close_wait_leak-t.cpp to poll count_close_wait() with a bounded timeout until the CLOSE_WAIT count stabilizes or a limit is reached, and use that result before asserting. Keep the existing mysql_close_no_command and make_pvio_null_conn flow, but replace the magic sleep with a retry/wait helper so the test in the fixed-path block is resilient to slow kernel teardown.
139-177: 📐 Maintainability & Code Quality | 🔵 Trivial | 🏗️ Heavy liftTest reimplements the fix rather than exercising the actual
close_mysql()code paths.Tests 2 and 3 manually duplicate the
else-if (pvio==NULL && fd>0) { close(fd); fd=-1; }logic (lines 164-167, 198-201) instead of invoking the actualclose_mysql()inlib/MySQL_Monitor.cpporMySQL_Connection::close_mysql()inlib/mysql_connection.cpp. This validates the general strategy but doesn't guard against regressions if those specific implementations diverge (e.g., wrong field checked, ordering, an accidental omission) from what's duplicated here.Given
close_mysql()in MySQL_Monitor.cpp isstaticandMySQL_Connection::close_mysql()needs a full session/monitor context, directly invoking them from a standalone TAP test may not be straightforward — but consider whether either function could be made testable (e.g., non-static, or a minimalMySQL_Connectionfixture) in a follow-up so the regression test exercises the real code instead of a re-implementation.Also applies to: 180-214
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/tap/tests/reg_test_5854_close_wait_leak-t.cpp` around lines 139 - 177, The test currently duplicates the fixed close logic instead of exercising the real implementation, so update the TAP coverage to call the actual close path used by the product code. In test_fixed_path_closes_fd and the corresponding leak test, prefer invoking close_mysql() from MySQL_Monitor.cpp or MySQL_Connection::close_mysql() from lib/mysql_connection.cpp through a minimal testable fixture/context rather than manually closing fd and clearing net.fd. If those symbols are not directly accessible, make the relevant close path testable in a follow-up so this regression test verifies the real code path instead of a reimplementation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@test/tap/tests/reg_test_5854_close_wait_leak-t.cpp`:
- Around line 1-232: The new TAP regression test is implemented in
reg_test_5854_close_wait_leak-t.cpp but is not included in the TAP group
registry, so it will not run. Add this test to the appropriate entry in
groups.json so the TAP harness discovers it, using the
reg_test_5854_close_wait_leak-t.cpp filename as the identifier to locate it.
---
Nitpick comments:
In `@test/tap/tests/reg_test_5854_close_wait_leak-t.cpp`:
- Around line 186-213: The fixed 100ms usleep in the close-wait leak test is too
brittle for loaded CI runners. Update the loop in
reg_test_5854_close_wait_leak-t.cpp to poll count_close_wait() with a bounded
timeout until the CLOSE_WAIT count stabilizes or a limit is reached, and use
that result before asserting. Keep the existing mysql_close_no_command and
make_pvio_null_conn flow, but replace the magic sleep with a retry/wait helper
so the test in the fixed-path block is resilient to slow kernel teardown.
- Around line 139-177: The test currently duplicates the fixed close logic
instead of exercising the real implementation, so update the TAP coverage to
call the actual close path used by the product code. In
test_fixed_path_closes_fd and the corresponding leak test, prefer invoking
close_mysql() from MySQL_Monitor.cpp or MySQL_Connection::close_mysql() from
lib/mysql_connection.cpp through a minimal testable fixture/context rather than
manually closing fd and clearing net.fd. If those symbols are not directly
accessible, make the relevant close path testable in a follow-up so this
regression test verifies the real code path instead of a reimplementation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a1144bcb-9048-4374-844a-560392af2ab4
📒 Files selected for processing (1)
test/tap/tests/reg_test_5854_close_wait_leak-t.cpp
📜 Review details
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.{cpp,h,hpp}: Feature tiers are controlled via flags:PROXYSQL31=1for v3.1.x features (FFTO, TSDB),PROXYSQL40=1for v4.0.x features (plugin loader).PROXYSQL40=1implies bothPROXYSQL31=1andPROXYSQLFFTO=1andPROXYSQLTSDB=1. Use conditional compilation with#ifdef PROXYSQL31,#ifdef PROXYSQL40,#ifdef PROXYSQLFFTO,#ifdef PROXYSQLTSDB,#ifdef PROXYSQLCLICKHOUSE.
Class names usePascalCasewith protocol prefixes:MySQL_,PgSQL_, orProxySQL_(e.g.,MySQL_Protocol,PgSQL_Session).
Member variables usesnake_case.
Constants and macros useUPPER_SNAKE_CASE.
Use C++17; conditional compilation for feature tiers via#ifdef PROXYSQL31,#ifdef PROXYSQL40,#ifdef PROXYSQLFFTO,#ifdef PROXYSQLTSDB,#ifdef PROXYSQLCLICKHOUSE.
Use RAII for resource management; use jemalloc for memory allocation.
Use pthread mutexes for synchronization; usestd::atomic<>for counters.
Files:
test/tap/tests/reg_test_5854_close_wait_leak-t.cpp
test/tap/tests/**/*.cpp
📄 CodeRabbit inference engine (CLAUDE.md)
Test files follow naming pattern
test_*.cppor*-t.cppintest/tap/tests/. Test binaries are built via pattern rulemake <testname>-twhich compiles<testname>-t.cppinto<testname>-t. Register new tests ingroups.json.
Files:
test/tap/tests/reg_test_5854_close_wait_leak-t.cpp
🧠 Learnings (1)
📚 Learning: 2026-01-20T09:34:19.124Z
Learnt from: yuji-hatakeyama
Repo: sysown/proxysql PR: 5307
File: test/tap/tests/reg_test_5306-show_warnings_with_comment-t.cpp:39-48
Timestamp: 2026-01-20T09:34:19.124Z
Learning: In ProxySQL's TAP test suite, resource leaks (e.g., not calling mysql_close() on early return paths) are commonly tolerated because test processes are short-lived and OS frees resources on exit. This pattern applies to all C++ test files under test/tap/tests. When reviewing, recognize this as a project-wide test convention and focus on test correctness and isolation rather than insisting on fixing such leaks in these test files.
Applied to files:
test/tap/tests/reg_test_5854_close_wait_leak-t.cpp
🔇 Additional comments (2)
test/tap/tests/reg_test_5854_close_wait_leak-t.cpp (2)
34-47: LGTM!Also applies to: 52-68
128-128: 🎯 Functional Correctness
mysql_close_no_commandis available to this TAP test via<mysql.h>.
No change needed.
Regression testTest: The test reproduces the
fd/CLOSE_WAIT measurement uses |
|



When a backend server is removed (e.g. an AWS Aurora read replica is deleted) the remote end sends a TCP FIN. The MariaDB connector processes this FIN during the next async I/O operation and clears net.pvio before ProxySQL gets a chance to call close_mysql().
close_mysql() only sends COM_QUIT (which triggers end_server() -> close(fd) inside mysql_close_no_command()) when net.pvio != NULL. With pvio already NULL, mysql_close_no_command() skips end_server() entirely, so the kernel-level TCP socket is never closed. The socket remains in CLOSE_WAIT indefinitely.
The monitor threads (ping, read_only, connect) retry every mysql-monitor_ping_interval ms. Each failed retry creates a new connection attempt; when that attempt also fails with pvio=NULL the same path is taken, leaving one more CLOSE_WAIT socket behind. Over time the count grows steadily until proxysql is restarted, at which point all file descriptors are released.
Fix: in close_mysql(), add an else-if branch that detects the pvio=NULL / net.fd>0 state and explicitly calls shutdown(SHUT_RDWR) + close(fd) before delegating to mysql_close_no_command(). This mirrors what end_server() would have done and guarantees the kernel socket is released regardless of whether the MariaDB connector has already cleared pvio.
Reproducer:
After this fix the count stays flat after replica deletion.
Changes in v2
TAP regression test is in progress.
Related issues: #2908, #3022
Summary by CodeRabbit