Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Oct 17, 2025

📄 19% (0.19x) speedup for ContextManager.compress_messages in src/utils/context_manager.py

⏱️ Runtime : 12.5 milliseconds 10.5 milliseconds (best of 109 runs)

📝 Explanation and details

The optimized code achieves a 19% speedup through several key performance optimizations:

1. Eliminated redundant token counting in count_tokens()

  • Replaced explicit loop with sum() generator expression and local variable caching
  • Cached self._count_message_tokens as a local variable to avoid repeated attribute lookups in the hot loop
  • Profile shows 40% reduction in time (34.9ms → 20.9ms) for this frequently called method

2. Avoided duplicate computation in compress_messages()

  • Added token_count = self.count_tokens(messages) to compute once and reuse
  • Previously called self.count_tokens(messages) twice - once in is_over_limit() and again in the logging statement
  • This eliminates expensive recomputation of token counts for the same message list

3. Micro-optimizations in _compress_messages()

  • Cached self._count_message_tokens and self._truncate_message_content as local variables
  • Replaced expensive list slicing messages[len(prefix_messages):] with direct indexing using prefix_count
  • Optimized suffix message building by using append() + single reverse() instead of repeated list concatenation [item] + list

4. Performance characteristics by test case:

  • Small inputs (empty/single messages): 30-32% slower due to optimization overhead
  • Medium inputs (multiple messages): 5-10% slower for simple cases
  • Large-scale inputs (1000+ messages): 50-52% faster where optimizations shine
  • Complex compression scenarios: 0.3-5% faster with reduced redundant operations

The optimizations are most effective for large message lists where token counting dominates runtime, making this ideal for production scenarios with extensive conversation histories.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 69 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 6 Passed
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import pytest
from langchain_core.messages import (AIMessage, BaseMessage, HumanMessage,
                                     SystemMessage, ToolMessage)
from src.utils.context_manager import ContextManager

# function to test
# (Assume the ContextManager class and compress_messages method are already defined above.)

# Helper function to create a state dict for testing
def make_state(messages):
    return {"messages": messages}

# Helper function to extract message contents for easier assertion
def get_contents(messages):
    return [m.content for m in messages]

# -------------------
# Basic Test Cases
# -------------------

def test_no_compression_needed():
    """Messages fit within token limit, should not be compressed or truncated."""
    cm = ContextManager(token_limit=100)
    msgs = [
        HumanMessage(content="Hello!"),
        AIMessage(content="Hi there!"),
        HumanMessage(content="How are you?"),
    ]
    state = make_state(msgs)
    codeflash_output = cm.compress_messages(state.copy()); result = codeflash_output # 8.10μs -> 8.79μs (7.81% slower)

def test_compression_simple_truncation():
    """Messages exceed token limit, last message should be truncated."""
    cm = ContextManager(token_limit=15)
    msgs = [
        HumanMessage(content="0123456789"),    # 10 tokens
        AIMessage(content="abcdefghij"),       # 10 tokens (will be truncated)
    ]
    state = make_state(msgs)
    codeflash_output = cm.compress_messages(state.copy()); result = codeflash_output # 6.51μs -> 7.07μs (7.86% slower)

def test_preserve_prefix_message_count():
    """Prefix messages should be preserved as much as possible, even if token limit is low."""
    cm = ContextManager(token_limit=12, preserve_prefix_message_count=2)
    msgs = [
        SystemMessage(content="sysmsg"),       # 6 tokens
        HumanMessage(content="humanmsg"),      # 8 tokens (will be truncated)
        AIMessage(content="aimsg"),            # 5 tokens, should be dropped
    ]
    state = make_state(msgs)
    codeflash_output = cm.compress_messages(state.copy()); result = codeflash_output # 7.59μs -> 8.37μs (9.21% slower)


def test_empty_messages():
    """Empty message list should return unchanged."""
    cm = ContextManager(token_limit=10)
    msgs = []
    state = make_state(msgs)
    codeflash_output = cm.compress_messages(state.copy()); result = codeflash_output # 1.48μs -> 2.21μs (32.8% slower)

# -------------------
# Edge Test Cases
# -------------------

def test_token_limit_none_returns_original():
    """If token_limit is None, should return original state."""
    cm = ContextManager(token_limit=None)
    msgs = [HumanMessage(content="abc")]
    state = make_state(msgs)
    codeflash_output = cm.compress_messages(state.copy()); result = codeflash_output # 470μs -> 468μs (0.398% faster)

def test_state_missing_messages_key():
    """If state dict missing 'messages' key, should return original state."""
    cm = ContextManager(token_limit=10)
    state = {"not_messages": []}
    codeflash_output = cm.compress_messages(state.copy()); result = codeflash_output # 393μs -> 397μs (1.04% slower)

def test_message_exact_token_fit():
    """Messages exactly fit token limit, should not be compressed."""
    cm = ContextManager(token_limit=10)
    msg = HumanMessage(content="abcdefghij")  # 10 tokens
    state = make_state([msg])
    codeflash_output = cm.compress_messages(state.copy()); result = codeflash_output # 5.89μs -> 6.41μs (8.05% slower)

def test_message_exceeds_token_limit_by_one():
    """Message exceeds token limit by one, should be truncated by one character."""
    cm = ContextManager(token_limit=9)
    msg = HumanMessage(content="abcdefghij")  # 10 tokens
    state = make_state([msg])
    codeflash_output = cm.compress_messages(state.copy()); result = codeflash_output # 4.78μs -> 5.10μs (6.28% slower)

def test_preserve_prefix_message_count_exceeds_messages():
    """Preserve prefix count greater than number of messages should not error."""
    cm = ContextManager(token_limit=100, preserve_prefix_message_count=10)
    msgs = [HumanMessage(content="msg1"), AIMessage(content="msg2")]
    state = make_state(msgs)
    codeflash_output = cm.compress_messages(state.copy()); result = codeflash_output # 6.61μs -> 7.13μs (7.25% slower)

def test_truncate_message_content_preserves_other_fields():
    """Truncated message should preserve all other attributes except content."""
    cm = ContextManager(token_limit=5)
    msg = AIMessage(content="abcdefghij", additional_kwargs={"foo": "bar"})
    state = make_state([msg])
    codeflash_output = cm.compress_messages(state.copy()); result = codeflash_output # 7.38μs -> 7.99μs (7.62% slower)


def test_message_with_additional_kwargs_tool_calls():
    """Message with additional_kwargs including 'tool_calls' should add extra tokens."""
    cm = ContextManager(token_limit=60)
    msg = AIMessage(content="short", additional_kwargs={"tool_calls": "call"})
    state = make_state([msg])
    # Should not be truncated, as token estimation includes extra tokens
    codeflash_output = cm.compress_messages(state.copy()); result = codeflash_output # 8.79μs -> 9.94μs (11.6% slower)

def test_message_with_large_additional_kwargs():
    """Message with large additional_kwargs should be dropped if token limit is small."""
    cm = ContextManager(token_limit=10)
    msg = AIMessage(content="short", additional_kwargs={"long": "x" * 100})
    state = make_state([msg])
    codeflash_output = cm.compress_messages(state.copy()); result = codeflash_output # 544μs -> 539μs (0.972% faster)

# -------------------
# Large Scale Test Cases
# -------------------

def test_many_messages_compression():
    """Test with many messages, only last messages should be preserved up to token limit."""
    cm = ContextManager(token_limit=100)
    # Each message is 10 tokens, so only 10 messages should fit
    msgs = [HumanMessage(content=f"msg{i:02d}abcdefgh") for i in range(20)]
    state = make_state(msgs)
    codeflash_output = cm.compress_messages(state.copy()); result = codeflash_output # 28.8μs -> 29.1μs (1.16% slower)

def test_large_messages_with_prefix_preservation():
    """Test large messages with prefix preservation, only prefix and last messages should be kept."""
    cm = ContextManager(token_limit=100, preserve_prefix_message_count=3)
    msgs = [SystemMessage(content="sysmsg" * 5), HumanMessage(content="humanmsg" * 5)] + \
           [AIMessage(content=f"aimsg{i}" * 5) for i in range(18)]
    state = make_state(msgs)
    codeflash_output = cm.compress_messages(state.copy()); result = codeflash_output # 625μs -> 594μs (5.36% faster)
    # Total messages should not exceed token limit
    total_tokens = sum([cm._count_message_tokens(m) for m in result["messages"]])

def test_compression_performance_large_scale():
    """Performance: compress_messages should run quickly on 1000 messages."""
    import time
    cm = ContextManager(token_limit=500)
    msgs = [HumanMessage(content="x" * 5) for _ in range(1000)]
    state = make_state(msgs)
    start = time.time()
    codeflash_output = cm.compress_messages(state.copy()); result = codeflash_output # 2.87ms -> 1.89ms (52.1% faster)
    end = time.time()


def test_large_messages_with_truncation():
    """Large messages that require truncation to fit token limit."""
    cm = ContextManager(token_limit=50)
    msgs = [HumanMessage(content="A" * 20), AIMessage(content="B" * 40), SystemMessage(content="C" * 30)]
    state = make_state(msgs)
    codeflash_output = cm.compress_messages(state.copy()); result = codeflash_output # 12.0μs -> 13.1μs (8.79% slower)
    # Should keep as much as possible, possibly truncating last message
    total_content_length = sum(len(m.content) for m in result["messages"])
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import pytest  # used for our unit tests
from langchain_core.messages import (AIMessage, BaseMessage, HumanMessage,
                                     SystemMessage, ToolMessage)
from src.utils.context_manager import ContextManager

# Function to test is defined above: ContextManager.compress_messages

# Helper to create a state dict
def make_state(messages):
    return {"messages": messages}

# Helper to extract message contents for easier comparison
def get_contents(messages):
    return [m.content for m in messages]

# Basic Test Cases

def test_no_token_limit_returns_original():
    """If token_limit is None, should return original state unmodified."""
    cm = ContextManager(token_limit=None)
    orig_state = make_state([HumanMessage(content="hello", type="human")])
    codeflash_output = cm.compress_messages(orig_state.copy()); result = codeflash_output # 443μs -> 442μs (0.201% faster)

def test_messages_under_limit_are_unchanged():
    """If messages are under the token limit, no compression should occur."""
    cm = ContextManager(token_limit=100)
    msgs = [
        SystemMessage(content="sys", type="system"),
        HumanMessage(content="hi", type="human"),
        AIMessage(content="hello", type="ai"),
    ]
    orig_state = make_state(msgs)
    codeflash_output = cm.compress_messages(orig_state.copy()); result = codeflash_output # 9.06μs -> 9.90μs (8.49% slower)

def test_messages_exactly_at_limit():
    """Messages exactly at the token limit should not be compressed."""
    cm = ContextManager(token_limit=6)
    msgs = [
        HumanMessage(content="abc", type="human"),  # 3 tokens (content) + 2 (type) = 5
        AIMessage(content="d", type="ai"),          # 1 (content) + 2 (type) = 3 * 1.2 = 3.6 -> int(3.6)=3
    ]
    # Total: 5 + 3 = 8 > 6, so should be compressed
    orig_state = make_state(msgs)
    codeflash_output = cm.compress_messages(orig_state.copy()); result = codeflash_output # 6.10μs -> 6.38μs (4.42% slower)

def test_preserve_prefix_message_count_preserves_head():
    """Should preserve the specified number of prefix messages, even if it means truncating the last preserved one."""
    cm = ContextManager(token_limit=10, preserve_prefix_message_count=2)
    msgs = [
        SystemMessage(content="sys", type="system"),      # 3+6=9*1.1=9.9->9
        HumanMessage(content="abcdef", type="human"),     # 6+5=11
        AIMessage(content="hello world", type="ai"),      # 11+2=13*1.2=15.6->15
    ]
    orig_state = make_state(msgs)
    codeflash_output = cm.compress_messages(orig_state.copy()); result = codeflash_output # 7.42μs -> 8.40μs (11.7% slower)

def test_messages_are_compressed_from_tail():
    """Should compress messages from the tail, preserving prefix if specified."""
    cm = ContextManager(token_limit=10)
    msgs = [
        HumanMessage(content="first", type="human"),      # 5+5=10
        AIMessage(content="second", type="ai"),           # 6+2=8*1.2=9.6->9
        SystemMessage(content="third", type="system"),    # 5+6=11*1.1=12.1->12
    ]
    orig_state = make_state(msgs)
    codeflash_output = cm.compress_messages(orig_state.copy()); result = codeflash_output # 7.05μs -> 7.83μs (9.94% slower)
    if len(result["messages"]) == 2:
        pass
    else:
        pass

def test_truncate_message_content_preserves_other_fields():
    """When truncating, should only modify content, not type or additional_kwargs."""
    cm = ContextManager(token_limit=2)
    msg = HumanMessage(content="abcdef", type="human", additional_kwargs={"foo": "bar"})
    truncated = cm._truncate_message_content(msg, 2)

# Edge Test Cases

def test_empty_messages_list():
    """Empty messages list should return unchanged."""
    cm = ContextManager(token_limit=10)
    orig_state = make_state([])
    codeflash_output = cm.compress_messages(orig_state.copy()); result = codeflash_output # 1.20μs -> 1.76μs (31.7% slower)

def test_missing_messages_key_in_state():
    """If state dict is missing 'messages', should return original state."""
    cm = ContextManager(token_limit=10)
    orig_state = {"foo": "bar"}
    codeflash_output = cm.compress_messages(orig_state.copy()); result = codeflash_output # 437μs -> 438μs (0.379% slower)

def test_non_dict_state():
    """If state is not a dict, should return it unchanged."""
    cm = ContextManager(token_limit=10)
    orig_state = ["not", "a", "dict"]
    codeflash_output = cm.compress_messages(orig_state); result = codeflash_output # 395μs -> 394μs (0.274% faster)

def test_message_with_empty_content():
    """Messages with empty content should still count at least 1 token."""
    cm = ContextManager(token_limit=1)
    msg = HumanMessage(content="", type="human")
    orig_state = make_state([msg])
    codeflash_output = cm.compress_messages(orig_state.copy()); result = codeflash_output # 4.86μs -> 5.22μs (6.84% slower)


def test_message_with_additional_kwargs_and_tool_calls():
    """Messages with additional_kwargs and tool_calls should increase token count."""
    cm = ContextManager(token_limit=60)
    msg = AIMessage(content="short", type="ai", additional_kwargs={"tool_calls": [{"foo": "bar"}]})
    orig_state = make_state([msg])
    # Should count content + type + extra_str + 50 for tool_calls, then *1.2 for AIMessage
    codeflash_output = cm.compress_messages(orig_state.copy()); result = codeflash_output # 10.3μs -> 11.3μs (9.19% slower)

def test_message_with_long_additional_kwargs():
    """Messages with large additional_kwargs should be truncated if over limit."""
    cm = ContextManager(token_limit=10)
    msg = AIMessage(content="short", type="ai", additional_kwargs={"foo": "x"*100})
    orig_state = make_state([msg])
    codeflash_output = cm.compress_messages(orig_state.copy()); result = codeflash_output # 544μs -> 542μs (0.312% faster)

def test_preserve_prefix_message_count_greater_than_messages():
    """Preserve count greater than messages should not error and preserve all."""
    cm = ContextManager(token_limit=100, preserve_prefix_message_count=10)
    msgs = [
        HumanMessage(content="a", type="human"),
        AIMessage(content="b", type="ai"),
    ]
    orig_state = make_state(msgs)
    codeflash_output = cm.compress_messages(orig_state.copy()); result = codeflash_output # 7.11μs -> 7.67μs (7.29% slower)

def test_token_limit_zero():
    """Zero token limit should return empty message list."""
    cm = ContextManager(token_limit=0)
    msgs = [
        HumanMessage(content="a", type="human"),
        AIMessage(content="b", type="ai"),
    ]
    orig_state = make_state(msgs)
    codeflash_output = cm.compress_messages(orig_state.copy()); result = codeflash_output # 484μs -> 485μs (0.039% slower)

def test_token_limit_one():
    """Token limit one should return only one token's worth of content from last message."""
    cm = ContextManager(token_limit=1)
    msgs = [
        HumanMessage(content="abc", type="human"),
        AIMessage(content="def", type="ai"),
    ]
    orig_state = make_state(msgs)
    codeflash_output = cm.compress_messages(orig_state.copy()); result = codeflash_output # 460μs -> 459μs (0.133% faster)

# Large Scale Test Cases

def test_large_number_of_short_messages():
    """Test compressing a large number of short messages."""
    cm = ContextManager(token_limit=500)
    msgs = [HumanMessage(content=str(i), type="human") for i in range(300)]  # Each content ~1-3 tokens
    orig_state = make_state(msgs)
    codeflash_output = cm.compress_messages(orig_state.copy()); result = codeflash_output # 242μs -> 243μs (0.153% slower)
    # Should fit as many as possible from the tail
    total_tokens = 0
    for m in reversed(msgs):
        t = cm._count_message_tokens(m)
        if total_tokens + t > 500:
            break
        total_tokens += t
    # Should keep last N messages that fit in 500 tokens
    expected = msgs[-(total_tokens // 2):] if total_tokens else []

def test_large_message_content_truncation():
    """Test that a single very large message is truncated to fit token limit."""
    long_content = "x" * 1000
    cm = ContextManager(token_limit=100)
    msg = HumanMessage(content=long_content, type="human")
    orig_state = make_state([msg])
    codeflash_output = cm.compress_messages(orig_state.copy()); result = codeflash_output # 616μs -> 587μs (4.92% faster)


def test_performance_with_many_messages():
    """Ensure function does not crash or hang with near-1000 messages."""
    cm = ContextManager(token_limit=500)
    msgs = [HumanMessage(content="hello", type="human") for _ in range(999)]
    orig_state = make_state(msgs)
    codeflash_output = cm.compress_messages(orig_state.copy()); result = codeflash_output # 2.88ms -> 1.91ms (50.6% faster)

def test_preserve_prefix_and_large_tail():
    """Test preserve_prefix_message_count with large tail and token limit."""
    cm = ContextManager(token_limit=50, preserve_prefix_message_count=5)
    msgs = [HumanMessage(content="prefix", type="human") for _ in range(5)] + \
           [AIMessage(content="tail", type="ai") for _ in range(20)]
    orig_state = make_state(msgs)
    codeflash_output = cm.compress_messages(orig_state.copy()); result = codeflash_output # 28.8μs -> 29.3μs (1.67% slower)
    # Should keep all prefix messages (if they fit), and as many tail messages as possible
    prefix = result["messages"][:5]
    tail = result["messages"][5:]
    for m in prefix:
        pass
    for m in tail:
        pass
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from src.utils.context_manager import ContextManager

def test_ContextManager_compress_messages():
    ContextManager.compress_messages(ContextManager(-1, preserve_prefix_message_count=1), {'messages': ''})

def test_ContextManager_compress_messages_2():
    ContextManager.compress_messages(ContextManager(0, preserve_prefix_message_count=0), {})

def test_ContextManager_compress_messages_3():
    ContextManager.compress_messages(ContextManager(0, preserve_prefix_message_count=0), {'messages': ''})
🔎 Concolic Coverage Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
codeflash_concolic_0_gkn0tr/tmp4i8q2xt7/test_concolic_coverage.py::test_ContextManager_compress_messages 494μs 495μs -0.320%⚠️
codeflash_concolic_0_gkn0tr/tmp4i8q2xt7/test_concolic_coverage.py::test_ContextManager_compress_messages_2 392μs 394μs -0.465%⚠️
codeflash_concolic_0_gkn0tr/tmp4i8q2xt7/test_concolic_coverage.py::test_ContextManager_compress_messages_3 1.35μs 1.82μs -25.4%⚠️

To edit these changes git checkout codeflash/optimize-ContextManager.compress_messages-mguzugc1 and push.

Codeflash

The optimized code achieves a **19% speedup** through several key performance optimizations:

**1. Eliminated redundant token counting in `count_tokens()`**
- Replaced explicit loop with `sum()` generator expression and local variable caching
- Cached `self._count_message_tokens` as a local variable to avoid repeated attribute lookups in the hot loop
- Profile shows 40% reduction in time (34.9ms → 20.9ms) for this frequently called method

**2. Avoided duplicate computation in `compress_messages()`** 
- Added `token_count = self.count_tokens(messages)` to compute once and reuse
- Previously called `self.count_tokens(messages)` twice - once in `is_over_limit()` and again in the logging statement
- This eliminates expensive recomputation of token counts for the same message list

**3. Micro-optimizations in `_compress_messages()`**
- Cached `self._count_message_tokens` and `self._truncate_message_content` as local variables
- Replaced expensive list slicing `messages[len(prefix_messages):]` with direct indexing using `prefix_count`
- Optimized suffix message building by using `append()` + single `reverse()` instead of repeated list concatenation `[item] + list`

**4. Performance characteristics by test case:**
- **Small inputs (empty/single messages)**: 30-32% slower due to optimization overhead
- **Medium inputs (multiple messages)**: 5-10% slower for simple cases  
- **Large-scale inputs (1000+ messages)**: **50-52% faster** where optimizations shine
- **Complex compression scenarios**: 0.3-5% faster with reduced redundant operations

The optimizations are most effective for **large message lists** where token counting dominates runtime, making this ideal for production scenarios with extensive conversation histories.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 October 17, 2025 15:18
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Oct 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant