Skip to content

fix: close XML file on parse-error paths to avoid FILE* leak#2247

Open
EylonKrause wants to merge 1 commit into
NVIDIA:masterfrom
EylonKrause:fix/xml-fd-leak
Open

fix: close XML file on parse-error paths to avoid FILE* leak#2247
EylonKrause wants to merge 1 commit into
NVIDIA:masterfrom
EylonKrause:fix/xml-fd-leak

Conversation

@EylonKrause

Copy link
Copy Markdown

Description

ncclTopoGetXmlFromFile, ncclTopoGetXmlGraphFromFile, and ncclTopoDumpXmlToFile (src/graph/xml.cc) only call fclose() on the success path. Because NCCLCHECK(...) returns immediately on a non-success result, an error from xmlLoadSub() / ncclTopoDumpXmlRec() returns from the function before the fclose(), leaking the FILE* (and its underlying file descriptor).

This triggers whenever NCCL_TOPO_FILE or NCCL_GRAPH_FILE points at a file that fails to parse — e.g. a wrong version attribute, an unterminated/mismatched tag, or an over-long value/name. Each affected ncclCommInitRank then leaks one descriptor.

Fix: capture the result, fclose() unconditionally, then propagate via NCCLCHECK — mirroring the exit:-label cleanup discipline already used elsewhere in this file (e.g. ncclTopoGetXmlFromSys).

Related Issues

None.

Changes & Impact

  • src/graph/xml.cc: three call sites (two read paths + the dump path). The only behavioral change is that the file handle is released on the error path. No API change; non-breaking.

Performance Impact

None — this is error-path-only resource cleanup.

Testing

  • Builds clean with make src.build (no new warnings).
  • Before/after repro with a malformed topology file (<system version="2"></system> via NCCL_TOPO_FILE), looping ncclCommInitRank and counting open fds pointing at that file:
    • Before: open fds to the topo file grow 1, 2, 3, 4, 5, 6 across six failed inits (leak).
    • After: stays at 0 on every iteration (closed).

ncclTopoGetXmlFromFile, ncclTopoGetXmlGraphFromFile and
ncclTopoDumpXmlToFile only called fclose() on the success path; any
error from xmlLoadSub/ncclTopoDumpXmlRec returned via NCCLCHECK before
the fclose, leaking the FILE* (and its fd) on every malformed
NCCL_TOPO_FILE/NCCL_GRAPH_FILE. Close the file first, then propagate.

Signed-off-by: EylonKrause <eylon1909@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant