fix: run tokenizer.encode in executor to avoid blocking event loop#1258
fix: run tokenizer.encode in executor to avoid blocking event loop#1258
Conversation
… loop tokenizer.encode is CPU-bound and blocks the event loop when called synchronously, stalling all concurrent I/O (streaming, new requests). Move it to run_in_executor so the event loop stays responsive. Test results (Qwen3.5-35B-A3B, 22k tokens): [sync] encode: 41.7ms | heartbeat ticks: 0 (loop fully blocked) [executor] encode: 40.2ms | heartbeat ticks: 36 | max gap: 2.1ms
There was a problem hiding this comment.
Code Review
This pull request updates the HTTP server manager to execute tokenizer encoding within a thread pool executor, preventing synchronous operations from blocking the asyncio event loop. It also introduces a new test script to verify the non-blocking behavior. Feedback includes a recommendation to use asyncio.get_running_loop() for better consistency and a suggestion to avoid hardcoded local paths in the test suite to ensure portability.
| prompt_ids = await asyncio.get_event_loop().run_in_executor( | ||
| None, | ||
| lambda: self.tokenizer.encode( | ||
| prompt, multimodal_params, add_special_tokens=sampling_params.add_special_tokens | ||
| ), | ||
| ) | ||
| else: | ||
| prompt_ids = self.tokenizer.encode(prompt, add_special_tokens=sampling_params.add_special_tokens) | ||
| prompt_ids = await asyncio.get_event_loop().run_in_executor( | ||
| None, | ||
| lambda: self.tokenizer.encode(prompt, add_special_tokens=sampling_params.add_special_tokens), | ||
| ) |
There was a problem hiding this comment.
Using asyncio.get_running_loop() is preferred over asyncio.get_event_loop() inside coroutines. get_running_loop() is more explicit and ensures that an error is raised if no loop is currently running, which helps in catching bugs during development. Additionally, it avoids potential deprecation warnings in newer Python versions.
| prompt_ids = await asyncio.get_event_loop().run_in_executor( | |
| None, | |
| lambda: self.tokenizer.encode( | |
| prompt, multimodal_params, add_special_tokens=sampling_params.add_special_tokens | |
| ), | |
| ) | |
| else: | |
| prompt_ids = self.tokenizer.encode(prompt, add_special_tokens=sampling_params.add_special_tokens) | |
| prompt_ids = await asyncio.get_event_loop().run_in_executor( | |
| None, | |
| lambda: self.tokenizer.encode(prompt, add_special_tokens=sampling_params.add_special_tokens), | |
| ) | |
| prompt_ids = await asyncio.get_running_loop().run_in_executor( | |
| None, | |
| lambda: self.tokenizer.encode( | |
| prompt, multimodal_params, add_special_tokens=sampling_params.add_special_tokens | |
| ), | |
| ) | |
| else: | |
| prompt_ids = await asyncio.get_running_loop().run_in_executor( | |
| None, | |
| lambda: self.tokenizer.encode(prompt, add_special_tokens=sampling_params.add_special_tokens), | |
| ) |
| import statistics | ||
| from transformers import AutoTokenizer | ||
|
|
||
| MODEL_DIR = "/nvme/models/Qwen3.5-35B-A3B" |
There was a problem hiding this comment.
The MODEL_DIR is hardcoded to a specific local path (/nvme/models/Qwen3.5-35B-A3B). This makes the test non-portable and likely to fail in CI environments or on other developers' machines. Consider using a small, publicly available model identifier (e.g., "Qwen/Qwen2.5-0.5B") or allowing the path to be set via an environment variable.
Summary
tokenizer.encodeinhttpserver/manager.py::_encode()is CPU-bound and was called synchronously on the asyncio event loop, blocking all concurrent I/O (streaming output, new request acceptance) during tokenizationrun_in_executorso the event loop stays responsive under concurrent long-text requestsTest results (Qwen3.5-35B-A3B, 22k tokens)
A heartbeat coroutine ticks every 1ms to detect event loop blocking:
Test plan
python test/test_tokenizer_blocking.py— verifies event loop is not blocked during tokenization