⚡️ Speed up method `ContextManager.compress_messages` by 19% #3

codeflash-ai · 2025-10-17T15:18:46Z

📄 19% (0.19x) speedup for `ContextManager.compress_messages` in `src/utils/context_manager.py`

⏱️ Runtime : 12.5 milliseconds → 10.5 milliseconds (best of 109 runs)

📝 Explanation and details

The optimized code achieves a 19% speedup through several key performance optimizations:

1. Eliminated redundant token counting in count_tokens()

Replaced explicit loop with sum() generator expression and local variable caching
Cached self._count_message_tokens as a local variable to avoid repeated attribute lookups in the hot loop
Profile shows 40% reduction in time (34.9ms → 20.9ms) for this frequently called method

2. Avoided duplicate computation in compress_messages()

Added token_count = self.count_tokens(messages) to compute once and reuse
Previously called self.count_tokens(messages) twice - once in is_over_limit() and again in the logging statement
This eliminates expensive recomputation of token counts for the same message list

3. Micro-optimizations in _compress_messages()

Cached self._count_message_tokens and self._truncate_message_content as local variables
Replaced expensive list slicing messages[len(prefix_messages):] with direct indexing using prefix_count
Optimized suffix message building by using append() + single reverse() instead of repeated list concatenation [item] + list

4. Performance characteristics by test case:

Small inputs (empty/single messages): 30-32% slower due to optimization overhead
Medium inputs (multiple messages): 5-10% slower for simple cases
Large-scale inputs (1000+ messages): 50-52% faster where optimizations shine
Complex compression scenarios: 0.3-5% faster with reduced redundant operations

The optimizations are most effective for large message lists where token counting dominates runtime, making this ideal for production scenarios with extensive conversation histories.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 69 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	✅ 6 Passed
📊 Tests Coverage	100.0%

🌀 Generated Regression Tests and Runtime

import pytest
from langchain_core.messages import (AIMessage, BaseMessage, HumanMessage,
                                     SystemMessage, ToolMessage)
from src.utils.context_manager import ContextManager

# function to test
# (Assume the ContextManager class and compress_messages method are already defined above.)

# Helper function to create a state dict for testing
def make_state(messages):
    return {"messages": messages}

# Helper function to extract message contents for easier assertion
def get_contents(messages):
    return [m.content for m in messages]

# -------------------
# Basic Test Cases
# -------------------

def test_no_compression_needed():
    """Messages fit within token limit, should not be compressed or truncated."""
    cm = ContextManager(token_limit=100)
    msgs = [
        HumanMessage(content="Hello!"),
        AIMessage(content="Hi there!"),
        HumanMessage(content="How are you?"),
    ]
    state = make_state(msgs)
    codeflash_output = cm.compress_messages(state.copy()); result = codeflash_output # 8.10μs -> 8.79μs (7.81% slower)

def test_compression_simple_truncation():
    """Messages exceed token limit, last message should be truncated."""
    cm = ContextManager(token_limit=15)
    msgs = [
        HumanMessage(content="0123456789"),    # 10 tokens
        AIMessage(content="abcdefghij"),       # 10 tokens (will be truncated)
    ]
    state = make_state(msgs)
    codeflash_output = cm.compress_messages(state.copy()); result = codeflash_output # 6.51μs -> 7.07μs (7.86% slower)

def test_preserve_prefix_message_count():
    """Prefix messages should be preserved as much as possible, even if token limit is low."""
    cm = ContextManager(token_limit=12, preserve_prefix_message_count=2)
    msgs = [
        SystemMessage(content="sysmsg"),       # 6 tokens
        HumanMessage(content="humanmsg"),      # 8 tokens (will be truncated)
        AIMessage(content="aimsg"),            # 5 tokens, should be dropped
    ]
    state = make_state(msgs)
    codeflash_output = cm.compress_messages(state.copy()); result = codeflash_output # 7.59μs -> 8.37μs (9.21% slower)


def test_empty_messages():
    """Empty message list should return unchanged."""
    cm = ContextManager(token_limit=10)
    msgs = []
    state = make_state(msgs)
    codeflash_output = cm.compress_messages(state.copy()); result = codeflash_output # 1.48μs -> 2.21μs (32.8% slower)

# -------------------
# Edge Test Cases
# -------------------

def test_token_limit_none_returns_original():
    """If token_limit is None, should return original state."""
    cm = ContextManager(token_limit=None)
    msgs = [HumanMessage(content="abc")]
    state = make_state(msgs)
    codeflash_output = cm.compress_messages(state.copy()); result = codeflash_output # 470μs -> 468μs (0.398% faster)

def test_state_missing_messages_key():
    """If state dict missing 'messages' key, should return original state."""
    cm = ContextManager(token_limit=10)
    state = {"not_messages": []}
    codeflash_output = cm.compress_messages(state.copy()); result = codeflash_output # 393μs -> 397μs (1.04% slower)

def test_message_exact_token_fit():
    """Messages exactly fit token limit, should not be compressed."""
    cm = ContextManager(token_limit=10)
    msg = HumanMessage(content="abcdefghij")  # 10 tokens
    state = make_state([msg])
    codeflash_output = cm.compress_messages(state.copy()); result = codeflash_output # 5.89μs -> 6.41μs (8.05% slower)

def test_message_exceeds_token_limit_by_one():
    """Message exceeds token limit by one, should be truncated by one character."""
    cm = ContextManager(token_limit=9)
    msg = HumanMessage(content="abcdefghij")  # 10 tokens
    state = make_state([msg])
    codeflash_output = cm.compress_messages(state.copy()); result = codeflash_output # 4.78μs -> 5.10μs (6.28% slower)

def test_preserve_prefix_message_count_exceeds_messages():
    """Preserve prefix count greater than number of messages should not error."""
    cm = ContextManager(token_limit=100, preserve_prefix_message_count=10)
    msgs = [HumanMessage(content="msg1"), AIMessage(content="msg2")]
    state = make_state(msgs)
    codeflash_output = cm.compress_messages(state.copy()); result = codeflash_output # 6.61μs -> 7.13μs (7.25% slower)

def test_truncate_message_content_preserves_other_fields():
    """Truncated message should preserve all other attributes except content."""
    cm = ContextManager(token_limit=5)
    msg = AIMessage(content="abcdefghij", additional_kwargs={"foo": "bar"})
    state = make_state([msg])
    codeflash_output = cm.compress_messages(state.copy()); result = codeflash_output # 7.38μs -> 7.99μs (7.62% slower)


def test_message_with_additional_kwargs_tool_calls():
    """Message with additional_kwargs including 'tool_calls' should add extra tokens."""
    cm = ContextManager(token_limit=60)
    msg = AIMessage(content="short", additional_kwargs={"tool_calls": "call"})
    state = make_state([msg])
    # Should not be truncated, as token estimation includes extra tokens
    codeflash_output = cm.compress_messages(state.copy()); result = codeflash_output # 8.79μs -> 9.94μs (11.6% slower)

def test_message_with_large_additional_kwargs():
    """Message with large additional_kwargs should be dropped if token limit is small."""
    cm = ContextManager(token_limit=10)
    msg = AIMessage(content="short", additional_kwargs={"long": "x" * 100})
    state = make_state([msg])
    codeflash_output = cm.compress_messages(state.copy()); result = codeflash_output # 544μs -> 539μs (0.972% faster)

# -------------------
# Large Scale Test Cases
# -------------------

def test_many_messages_compression():
    """Test with many messages, only last messages should be preserved up to token limit."""
    cm = ContextManager(token_limit=100)
    # Each message is 10 tokens, so only 10 messages should fit
    msgs = [HumanMessage(content=f"msg{i:02d}abcdefgh") for i in range(20)]
    state = make_state(msgs)
    codeflash_output = cm.compress_messages(state.copy()); result = codeflash_output # 28.8μs -> 29.1μs (1.16% slower)

def test_large_messages_with_prefix_preservation():
    """Test large messages with prefix preservation, only prefix and last messages should be kept."""
    cm = ContextManager(token_limit=100, preserve_prefix_message_count=3)
    msgs = [SystemMessage(content="sysmsg" * 5), HumanMessage(content="humanmsg" * 5)] + \
           [AIMessage(content=f"aimsg{i}" * 5) for i in range(18)]
    state = make_state(msgs)
    codeflash_output = cm.compress_messages(state.copy()); result = codeflash_output # 625μs -> 594μs (5.36% faster)
    # Total messages should not exceed token limit
    total_tokens = sum([cm._count_message_tokens(m) for m in result["messages"]])

def test_compression_performance_large_scale():
    """Performance: compress_messages should run quickly on 1000 messages."""
    import time
    cm = ContextManager(token_limit=500)
    msgs = [HumanMessage(content="x" * 5) for _ in range(1000)]
    state = make_state(msgs)
    start = time.time()
    codeflash_output = cm.compress_messages(state.copy()); result = codeflash_output # 2.87ms -> 1.89ms (52.1% faster)
    end = time.time()


def test_large_messages_with_truncation():
    """Large messages that require truncation to fit token limit."""
    cm = ContextManager(token_limit=50)
    msgs = [HumanMessage(content="A" * 20), AIMessage(content="B" * 40), SystemMessage(content="C" * 30)]
    state = make_state(msgs)
    codeflash_output = cm.compress_messages(state.copy()); result = codeflash_output # 12.0μs -> 13.1μs (8.79% slower)
    # Should keep as much as possible, possibly truncating last message
    total_content_length = sum(len(m.content) for m in result["messages"])
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import pytest  # used for our unit tests
from langchain_core.messages import (AIMessage, BaseMessage, HumanMessage,
                                     SystemMessage, ToolMessage)
from src.utils.context_manager import ContextManager

# Function to test is defined above: ContextManager.compress_messages

# Helper to create a state dict
def make_state(messages):
    return {"messages": messages}

# Helper to extract message contents for easier comparison
def get_contents(messages):
    return [m.content for m in messages]

# Basic Test Cases

def test_no_token_limit_returns_original():
    """If token_limit is None, should return original state unmodified."""
    cm = ContextManager(token_limit=None)
    orig_state = make_state([HumanMessage(content="hello", type="human")])
    codeflash_output = cm.compress_messages(orig_state.copy()); result = codeflash_output # 443μs -> 442μs (0.201% faster)

def test_messages_under_limit_are_unchanged():
    """If messages are under the token limit, no compression should occur."""
    cm = ContextManager(token_limit=100)
    msgs = [
        SystemMessage(content="sys", type="system"),
        HumanMessage(content="hi", type="human"),
        AIMessage(content="hello", type="ai"),
    ]
    orig_state = make_state(msgs)
    codeflash_output = cm.compress_messages(orig_state.copy()); result = codeflash_output # 9.06μs -> 9.90μs (8.49% slower)

def test_messages_exactly_at_limit():
    """Messages exactly at the token limit should not be compressed."""
    cm = ContextManager(token_limit=6)
    msgs = [
        HumanMessage(content="abc", type="human"),  # 3 tokens (content) + 2 (type) = 5
        AIMessage(content="d", type="ai"),          # 1 (content) + 2 (type) = 3 * 1.2 = 3.6 -> int(3.6)=3
    ]
    # Total: 5 + 3 = 8 > 6, so should be compressed
    orig_state = make_state(msgs)
    codeflash_output = cm.compress_messages(orig_state.copy()); result = codeflash_output # 6.10μs -> 6.38μs (4.42% slower)

def test_preserve_prefix_message_count_preserves_head():
    """Should preserve the specified number of prefix messages, even if it means truncating the last preserved one."""
    cm = ContextManager(token_limit=10, preserve_prefix_message_count=2)
    msgs = [
        SystemMessage(content="sys", type="system"),      # 3+6=9*1.1=9.9->9
        HumanMessage(content="abcdef", type="human"),     # 6+5=11
        AIMessage(content="hello world", type="ai"),      # 11+2=13*1.2=15.6->15
    ]
    orig_state = make_state(msgs)
    codeflash_output = cm.compress_messages(orig_state.copy()); result = codeflash_output # 7.42μs -> 8.40μs (11.7% slower)

def test_messages_are_compressed_from_tail():
    """Should compress messages from the tail, preserving prefix if specified."""
    cm = ContextManager(token_limit=10)
    msgs = [
        HumanMessage(content="first", type="human"),      # 5+5=10
        AIMessage(content="second", type="ai"),           # 6+2=8*1.2=9.6->9
        SystemMessage(content="third", type="system"),    # 5+6=11*1.1=12.1->12
    ]
    orig_state = make_state(msgs)
    codeflash_output = cm.compress_messages(orig_state.copy()); result = codeflash_output # 7.05μs -> 7.83μs (9.94% slower)
    if len(result["messages"]) == 2:
        pass
    else:
        pass

def test_truncate_message_content_preserves_other_fields():
    """When truncating, should only modify content, not type or additional_kwargs."""
    cm = ContextManager(token_limit=2)
    msg = HumanMessage(content="abcdef", type="human", additional_kwargs={"foo": "bar"})
    truncated = cm._truncate_message_content(msg, 2)

# Edge Test Cases

def test_empty_messages_list():
    """Empty messages list should return unchanged."""
    cm = ContextManager(token_limit=10)
    orig_state = make_state([])
    codeflash_output = cm.compress_messages(orig_state.copy()); result = codeflash_output # 1.20μs -> 1.76μs (31.7% slower)

def test_missing_messages_key_in_state():
    """If state dict is missing 'messages', should return original state."""
    cm = ContextManager(token_limit=10)
    orig_state = {"foo": "bar"}
    codeflash_output = cm.compress_messages(orig_state.copy()); result = codeflash_output # 437μs -> 438μs (0.379% slower)

def test_non_dict_state():
    """If state is not a dict, should return it unchanged."""
    cm = ContextManager(token_limit=10)
    orig_state = ["not", "a", "dict"]
    codeflash_output = cm.compress_messages(orig_state); result = codeflash_output # 395μs -> 394μs (0.274% faster)

def test_message_with_empty_content():
    """Messages with empty content should still count at least 1 token."""
    cm = ContextManager(token_limit=1)
    msg = HumanMessage(content="", type="human")
    orig_state = make_state([msg])
    codeflash_output = cm.compress_messages(orig_state.copy()); result = codeflash_output # 4.86μs -> 5.22μs (6.84% slower)


def test_message_with_additional_kwargs_and_tool_calls():
    """Messages with additional_kwargs and tool_calls should increase token count."""
    cm = ContextManager(token_limit=60)
    msg = AIMessage(content="short", type="ai", additional_kwargs={"tool_calls": [{"foo": "bar"}]})
    orig_state = make_state([msg])
    # Should count content + type + extra_str + 50 for tool_calls, then *1.2 for AIMessage
    codeflash_output = cm.compress_messages(orig_state.copy()); result = codeflash_output # 10.3μs -> 11.3μs (9.19% slower)

def test_message_with_long_additional_kwargs():
    """Messages with large additional_kwargs should be truncated if over limit."""
    cm = ContextManager(token_limit=10)
    msg = AIMessage(content="short", type="ai", additional_kwargs={"foo": "x"*100})
    orig_state = make_state([msg])
    codeflash_output = cm.compress_messages(orig_state.copy()); result = codeflash_output # 544μs -> 542μs (0.312% faster)

def test_preserve_prefix_message_count_greater_than_messages():
    """Preserve count greater than messages should not error and preserve all."""
    cm = ContextManager(token_limit=100, preserve_prefix_message_count=10)
    msgs = [
        HumanMessage(content="a", type="human"),
        AIMessage(content="b", type="ai"),
    ]
    orig_state = make_state(msgs)
    codeflash_output = cm.compress_messages(orig_state.copy()); result = codeflash_output # 7.11μs -> 7.67μs (7.29% slower)

def test_token_limit_zero():
    """Zero token limit should return empty message list."""
    cm = ContextManager(token_limit=0)
    msgs = [
        HumanMessage(content="a", type="human"),
        AIMessage(content="b", type="ai"),
    ]
    orig_state = make_state(msgs)
    codeflash_output = cm.compress_messages(orig_state.copy()); result = codeflash_output # 484μs -> 485μs (0.039% slower)

def test_token_limit_one():
    """Token limit one should return only one token's worth of content from last message."""
    cm = ContextManager(token_limit=1)
    msgs = [
        HumanMessage(content="abc", type="human"),
        AIMessage(content="def", type="ai"),
    ]
    orig_state = make_state(msgs)
    codeflash_output = cm.compress_messages(orig_state.copy()); result = codeflash_output # 460μs -> 459μs (0.133% faster)

# Large Scale Test Cases

def test_large_number_of_short_messages():
    """Test compressing a large number of short messages."""
    cm = ContextManager(token_limit=500)
    msgs = [HumanMessage(content=str(i), type="human") for i in range(300)]  # Each content ~1-3 tokens
    orig_state = make_state(msgs)
    codeflash_output = cm.compress_messages(orig_state.copy()); result = codeflash_output # 242μs -> 243μs (0.153% slower)
    # Should fit as many as possible from the tail
    total_tokens = 0
    for m in reversed(msgs):
        t = cm._count_message_tokens(m)
        if total_tokens + t > 500:
            break
        total_tokens += t
    # Should keep last N messages that fit in 500 tokens
    expected = msgs[-(total_tokens // 2):] if total_tokens else []

def test_large_message_content_truncation():
    """Test that a single very large message is truncated to fit token limit."""
    long_content = "x" * 1000
    cm = ContextManager(token_limit=100)
    msg = HumanMessage(content=long_content, type="human")
    orig_state = make_state([msg])
    codeflash_output = cm.compress_messages(orig_state.copy()); result = codeflash_output # 616μs -> 587μs (4.92% faster)


def test_performance_with_many_messages():
    """Ensure function does not crash or hang with near-1000 messages."""
    cm = ContextManager(token_limit=500)
    msgs = [HumanMessage(content="hello", type="human") for _ in range(999)]
    orig_state = make_state(msgs)
    codeflash_output = cm.compress_messages(orig_state.copy()); result = codeflash_output # 2.88ms -> 1.91ms (50.6% faster)

def test_preserve_prefix_and_large_tail():
    """Test preserve_prefix_message_count with large tail and token limit."""
    cm = ContextManager(token_limit=50, preserve_prefix_message_count=5)
    msgs = [HumanMessage(content="prefix", type="human") for _ in range(5)] + \
           [AIMessage(content="tail", type="ai") for _ in range(20)]
    orig_state = make_state(msgs)
    codeflash_output = cm.compress_messages(orig_state.copy()); result = codeflash_output # 28.8μs -> 29.3μs (1.67% slower)
    # Should keep all prefix messages (if they fit), and as many tail messages as possible
    prefix = result["messages"][:5]
    tail = result["messages"][5:]
    for m in prefix:
        pass
    for m in tail:
        pass
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from src.utils.context_manager import ContextManager

def test_ContextManager_compress_messages():
    ContextManager.compress_messages(ContextManager(-1, preserve_prefix_message_count=1), {'messages': ''})

def test_ContextManager_compress_messages_2():
    ContextManager.compress_messages(ContextManager(0, preserve_prefix_message_count=0), {})

def test_ContextManager_compress_messages_3():
    ContextManager.compress_messages(ContextManager(0, preserve_prefix_message_count=0), {'messages': ''})

🔎 Concolic Coverage Tests and Runtime

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`codeflash_concolic_0_gkn0tr/tmp4i8q2xt7/test_concolic_coverage.py::test_ContextManager_compress_messages`	494μs	495μs	-0.320%⚠️
`codeflash_concolic_0_gkn0tr/tmp4i8q2xt7/test_concolic_coverage.py::test_ContextManager_compress_messages_2`	392μs	394μs	-0.465%⚠️
`codeflash_concolic_0_gkn0tr/tmp4i8q2xt7/test_concolic_coverage.py::test_ContextManager_compress_messages_3`	1.35μs	1.82μs	-25.4%⚠️

To edit these changes git checkout codeflash/optimize-ContextManager.compress_messages-mguzugc1 and push.

The optimized code achieves a **19% speedup** through several key performance optimizations: **1. Eliminated redundant token counting in `count_tokens()`** - Replaced explicit loop with `sum()` generator expression and local variable caching - Cached `self._count_message_tokens` as a local variable to avoid repeated attribute lookups in the hot loop - Profile shows 40% reduction in time (34.9ms → 20.9ms) for this frequently called method **2. Avoided duplicate computation in `compress_messages()`** - Added `token_count = self.count_tokens(messages)` to compute once and reuse - Previously called `self.count_tokens(messages)` twice - once in `is_over_limit()` and again in the logging statement - This eliminates expensive recomputation of token counts for the same message list **3. Micro-optimizations in `_compress_messages()`** - Cached `self._count_message_tokens` and `self._truncate_message_content` as local variables - Replaced expensive list slicing `messages[len(prefix_messages):]` with direct indexing using `prefix_count` - Optimized suffix message building by using `append()` + single `reverse()` instead of repeated list concatenation `[item] + list` **4. Performance characteristics by test case:** - **Small inputs (empty/single messages)**: 30-32% slower due to optimization overhead - **Medium inputs (multiple messages)**: 5-10% slower for simple cases - **Large-scale inputs (1000+ messages)**: **50-52% faster** where optimizations shine - **Complex compression scenarios**: 0.3-5% faster with reduced redundant operations The optimizations are most effective for **large message lists** where token counting dominates runtime, making this ideal for production scenarios with extensive conversation histories.

codeflash-ai bot requested a review from mashraf-222 October 17, 2025 15:18

codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Oct 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up method `ContextManager.compress_messages` by 19% #3

⚡️ Speed up method `ContextManager.compress_messages` by 19% #3

Uh oh!

codeflash-ai bot commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up method ContextManager.compress_messages by 19% #3

Are you sure you want to change the base?

⚡️ Speed up method ContextManager.compress_messages by 19% #3

Uh oh!

Conversation

codeflash-ai bot commented Oct 17, 2025

📄 19% (0.19x) speedup for ContextManager.compress_messages in src/utils/context_manager.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up method `ContextManager.compress_messages` by 19% #3

⚡️ Speed up method `ContextManager.compress_messages` by 19% #3

📄 19% (0.19x) speedup for `ContextManager.compress_messages` in `src/utils/context_manager.py`