Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses issues with session closing logic in the lmdeploy session manager and includes additional CMake build configuration improvements. The main focus is on fixing how sessions are closed to prevent premature cancellation and ensure proper resource cleanup when the main program exits.
Changes:
- Enhanced the
Session.close()method to wait for async operations to complete with a 5-second timeout - Added early exit optimization for sessions that haven't processed any requests yet
- Improved error logging in async_close() to use logger.exception()
- Wrapped test executables in BUILD_TEST conditionals in two CMakeLists.txt files
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| lmdeploy/serve/managers/session_manager.py | Main fix: Added synchronization to close() method to wait for async operations, early return for unused sessions, and improved error logging |
| src/turbomind/kernels/CMakeLists.txt | Wrapped test_quantization executable in BUILD_TEST conditional |
| src/turbomind/comm/gloo/CMakeLists.txt | Wrapped test_ipc_comm executable in BUILD_TEST conditional |
Comments suppressed due to low confidence (1)
lmdeploy/serve/managers/session_manager.py:128
- The abort() method doesn't wait for the async operation to complete, unlike the new close() implementation. This inconsistency could lead to race conditions where abort() returns before the async_abort() coroutine finishes executing. While the async_abort() method has a comment indicating "DO NOT reset the session here because it might be used by other components," the lack of synchronization in abort() could still cause issues if the caller expects the abort operation to complete before proceeding.
Consider applying similar synchronization logic to abort() as was added to close(), or document why abort() doesn't need to wait for completion while close() does.
def abort(self):
"""Abort the session in sync mode."""
self._run(self.async_abort())
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (BUILD_TEST) | ||
| add_executable(test_quantization test_quantization.cc gemm/test/test_utils.cu) | ||
| target_link_libraries(test_quantization PRIVATE quantization_kernels core) | ||
| endif () |
There was a problem hiding this comment.
The changes to this CMakeLists.txt file (wrapping test executable in BUILD_TEST conditional) are unrelated to fixing session closing logic mentioned in the PR title. While the changes themselves are correct and follow the established pattern in the codebase (see src/turbomind/core/CMakeLists.txt:24, src/turbomind/comm/CMakeLists.txt:28), they appear to be a separate refactoring that should ideally be in a different PR or the PR description should mention both types of changes.
Consider updating the PR description to explain that this PR includes both a fix for session closing logic and a refactoring to conditionally build test executables.
| if (BUILD_TEST) | ||
| add_executable(test_ipc_comm test_ipc_comm.cc) | ||
| target_link_libraries(test_ipc_comm PRIVATE gloo_comm Threads::Threads) | ||
| endif () |
There was a problem hiding this comment.
The changes to this CMakeLists.txt file (wrapping test executable in BUILD_TEST conditional) are unrelated to fixing session closing logic mentioned in the PR title. While the changes themselves are correct and follow the established pattern in the codebase (see src/turbomind/core/CMakeLists.txt:24, src/turbomind/comm/CMakeLists.txt:28), they appear to be a separate refactoring that should ideally be in a different PR or the PR description should mention both types of changes.
Consider updating the PR description to explain that this PR includes both a fix for session closing logic and a refactoring to conditionally build test executables.
| """End the session.""" | ||
| logger.info(f'[session] Ending session {self.session_id}') | ||
| if self._handle is None and self.step == 0: | ||
| logger.info(f'[session] Closing session {self.session_id} before first request') |
There was a problem hiding this comment.
The early return in async_close() bypasses the reset() call at line 124, which can lead to incomplete session cleanup. When a session is closed before the first request, the session state (prompt, response, history, etc.) will not be reset, and importantly, _session_mgr will not be set to None. This could cause resource leaks if the session object is retained after closing.
Consider calling self.reset() before the early return to ensure consistent cleanup behavior regardless of when the session is closed.
| logger.info(f'[session] Closing session {self.session_id} before first request') | |
| logger.info(f'[session] Closing session {self.session_id} before first request') | |
| self.reset() |
| await handle.async_end(self.session_id) | ||
| except (Exception, asyncio.CancelledError, GeneratorExit) as e: | ||
| logger.error(f'[async_end] exception caught: {e}') | ||
| logger.exception(f'[async_end] exception caught: {type(e).__name__}: {e!r}') |
There was a problem hiding this comment.
The logger.exception() call automatically includes the exception type, message, and traceback. Adding the exception details manually in the message string creates redundant information in the logs. The pattern used elsewhere in the codebase (e.g., lmdeploy/metrics/metrics_processor.py:81, lmdeploy/pytorch/engine/engine.py:467) is to provide a descriptive message and let logger.exception() handle the exception details.
Consider simplifying to: logger.exception('[async_end] exception caught') to follow the codebase convention and avoid redundancy.
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
Motivation
Please describe the motivation of this PR and the goal you want to achieve through this PR.
Modification
Please briefly describe what modification is made in this PR.
BC-breaking (Optional)
Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.
Use cases (Optional)
If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.
Checklist