Skip to content

Conversation

@mohammedahmed18
Copy link
Contributor

@mohammedahmed18 mohammedahmed18 commented Nov 27, 2025

PR Type

Enhancement, Tests


Description

  • Add AI-driven code repair flow

  • Return rich test diff metadata

  • Parse pytest failures from stdout

  • Deduplicate candidate evaluation


Diagram Walkthrough

flowchart LR
  OPT["FunctionOptimizer"] -- "compare results" --> EQ["compare_test_results"]
  EQ -- "mismatch + diffs" --> REPAIR["AiServiceClient /code_repair"]
  REPAIR -- "new candidate" --> OPT
  PARSER["parse_test_failures_from_stdout"] -- "test_failures map" --> TR["TestResults"]
Loading

File Walkthrough

Relevant files
Enhancement
5 files
aiservice.py
Add code repair request/response handling                               
+51/-1   
models.py
Introduce TestDiff and code repair request models               
+69/-0   
function_optimizer.py
Integrate repair loop and candidate deduplication               
+157/-35
equivalence.py
Return match flag with detailed TestDiffs                               
+56/-26 
parse_test_output.py
Parse pytest failures into TestResults                                     
+60/-0   
Tests
5 files
test_codeflash_capture.py
Adapt tests to new compare API and add E2E repair scenario
+305/-6 
test_comparator.py
Update assertions for tuple return from comparator             
+14/-7   
test_instrument_all_and_run.py
Replace boolean compares with (match, diffs) usage             
+16/-8   
test_instrumentation_run_results_aiservice.py
Adjust to new comparator API and expectations                       
+6/-4     
test_pickle_patcher.py
Migrate to match/diffs comparator semantics                           
+4/-4     

@github-actions
Copy link

github-actions bot commented Nov 27, 2025

PR Reviewer Guide 🔍

(Review updated until commit 79387c3)

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Possible Issue

In compare_test_results, when original_test_result or cdd_test_result is None the function immediately returns (False, []). This short-circuits and discards any accumulated diffs and may hide useful context; also it can be inconsistent if some tests were processed already. Consider appending a DID_PASS/TIMED_OUT diff or continuing to gather diffs before returning.

# If helper function instance_state verification is not present, that's ok. continue
if (
    original_test_result.verification_type
    and original_test_result.verification_type == VerificationType.INIT_STATE_HELPER
    and cdd_test_result is None
):
    continue
if original_test_result is None or cdd_test_result is None:
    return False, []
did_all_timeout = did_all_timeout and original_test_result.timed_out
Robustness

parse_test_failures_from_stdout relies on regex parsing of pytest stdout blocks which can vary across pytest versions and plugins. The end-of-failures detection and header matching may be brittle; add guards for different formats or fallback to pytest JSON/–maxfail/–reportlog if available.

def parse_test_failures_from_stdout(test_results: TestResults, stdout: str) -> TestResults:
    """Extract individual pytest test failures from stdout grouped by test case qualified name, and add them to the test results."""
    lines = stdout.splitlines()
    start = end = None

    for i, line in enumerate(lines):
        if FAILURES_HEADER_RE.search(line.strip()):
            start = i
            break

    if start is None:
        return test_results

    for j in range(start + 1, len(lines)):
        stripped = lines[j].strip()
        if "short test summary info" in stripped:
            end = j
            break
        # any new === section === block
        if stripped.startswith("=") and stripped.count("=") > 3:
            end = j
            break

    # If no clear "end", just grap the rest of the string
    if end is None:
        end = len(lines)

    failure_block = lines[start:end]

    failures: dict[str, str] = {}
    current_name = None
    current_lines: list[str] = []

    for line in failure_block:
        m = TEST_HEADER_RE.match(line.strip())
        if m:
            if current_name is not None:
                failures[current_name] = "".join(current_lines)

            current_name = m.group(1)
            current_lines = []
        elif current_name:
            current_lines.append(line + "\n")

    if current_name:
        failures[current_name] = "".join(current_lines)

    test_results.test_failures = failures
    return test_results
State Consistency

ast_code_to_id is a mutable instance attribute used across candidate runs and recursive code-repair calls; ensure it’s correctly reset on each determine_best_candidate cycle and remains consistent after code_repair recursion to avoid stale mappings or mismatched optimization_ids.

logger.info(
    f"Determining best optimization candidate (out of {len(candidates)}) for "
    f"{self.function_to_optimize.qualified_name}…"
)
console.rule()

future_all_refinements: list[concurrent.futures.Future] = []
self.ast_code_to_id.clear()
valid_optimizations = []
optimizations_post = {}  # we need to overwrite some opt candidates' code strings as they are no longer evaluated, instead their shorter/longer versions might be evaluated

# Start a new thread for AI service request
ai_service_client = self.aiservice_client if exp_type == "EXP0" else self.local_aiservice_client
future_line_profile_results = self.executor.submit(
    ai_service_client.optimize_python_code_line_profiler,
    source_code=code_context.read_writable_code.markdown,
    dependency_code=code_context.read_only_context_code,
    trace_id=self.function_trace_id[:-4] + exp_type if self.experiment_id else self.function_trace_id,
    line_profiler_results=original_code_baseline.line_profile_results["str_out"],
    num_candidates=N_CANDIDATES_LP_EFFECTIVE,
    experiment_metadata=ExperimentMetadata(
        id=self.experiment_id, group="control" if exp_type == "EXP0" else "experiment"
    )
    if self.experiment_id
    else None,
)

# Initialize candidate processor
processor = CandidateProcessor(candidates, future_line_profile_results, future_all_refinements)
candidate_index = 0

# Process candidates using queue-based approach
while not processor.is_done():
    candidate = processor.get_next_candidate()
    if candidate is None:
        logger.debug("everything done, exiting")
        break

    try:
        candidate_index += 1
        get_run_tmp_file(Path(f"test_return_values_{candidate_index}.bin")).unlink(missing_ok=True)
        get_run_tmp_file(Path(f"test_return_values_{candidate_index}.sqlite")).unlink(missing_ok=True)
        logger.info(f"h3|Optimization candidate {candidate_index}/{processor.candidate_len}:")
        code_print(
            candidate.source_code.flat,
            file_name=f"candidate_{candidate_index}.py",
            lsp_message_id=LSPMessageId.CANDIDATE.value,
        )
        # map ast normalized code to diff len, unnormalized code
        # map opt id to the shortest unnormalized code
        try:
            did_update = self.replace_function_and_helpers_with_optimized_code(
                code_context=code_context,
                optimized_code=candidate.source_code,
                original_helper_code=original_helper_code,
            )
            if not did_update:
                logger.warning(
                    "force_lsp|No functions were replaced in the optimized code. Skipping optimization candidate."
                )
                console.rule()
                continue
        except (ValueError, SyntaxError, cst.ParserSyntaxError, AttributeError) as e:
            logger.error(e)
            self.write_code_and_helpers(
                self.function_to_optimize_source_code, original_helper_code, self.function_to_optimize.file_path
            )
            continue
        # check if this code has been evaluated before by checking the ast normalized code string
        normalized_code = normalize_code(candidate.source_code.flat.strip())
        if self.was_candidate_tested_before(normalized_code):
            self.update_results_for_duplicate_candidate(
                candidate=candidate,
                code_context=code_context,
                normalized_code=normalized_code,
                speedup_ratios=speedup_ratios,
                is_correct=is_correct,
                optimized_runtimes=optimized_runtimes,
                optimized_line_profiler_results=optimized_line_profiler_results,
                optimizations_post=optimizations_post,
            )
            continue
        self.ast_code_to_id[normalized_code] = {
            "optimization_id": candidate.optimization_id,
            "shorter_source_code": candidate.source_code,
            "diff_len": diff_length(candidate.source_code.flat, code_context.read_writable_code.flat),
        }
        run_results, new_candidate = self.run_optimized_candidate(
            optimization_candidate_index=candidate_index,
            baseline_results=original_code_baseline,
            original_helper_code=original_helper_code,
            file_path_to_helper_classes=file_path_to_helper_classes,
            code_context=code_context,
            candidate=candidate,
            exp_type=exp_type,
        )
        if candidate.optimization_id != new_candidate.optimization_id:
            # override the candidate if the optimization_id has changed, this may happen if the candidate was modified by the code-repair
            candidate = new_candidate

        console.rule()

@github-actions
Copy link

github-actions bot commented Nov 27, 2025

PR Code Suggestions ✨

Latest suggestions up to 79387c3
Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Create independent diff records

Avoid reusing a single TestDiff instance for multiple mismatch scopes; it causes
mixed data if multiple fields differ. Create and append a fresh TestDiff per
mismatch, ensuring each recorded diff is independent and accurate.

codeflash/verification/equivalence.py [65-114]

 test_src_code = original_test_result.id.get_src_code(original_test_result.file_name)
-test_diff = TestDiff(
-    scope=TestDiffScope.RETURN_VALUE,
-    original_value=f"{original_test_result.return_value!r}",
-    candidate_value=f"{cdd_test_result.return_value!r}",
-    test_src_code=test_src_code,
-    candidate_pytest_error=cdd_pytest_error,
-    original_pass=original_test_result.did_pass,
-    candidate_pass=cdd_test_result.did_pass,
-    original_pytest_error=original_pytest_error,
-)
+
+# Return value diff
 if not comparator(original_test_result.return_value, cdd_test_result.return_value, superset_obj=superset_obj):
-    test_diff.scope = TestDiffScope.RETURN_VALUE
-    test_diffs.append(test_diff)
+    test_diffs.append(
+        TestDiff(
+            scope=TestDiffScope.RETURN_VALUE,
+            original_value=f"{original_test_result.return_value!r}",
+            candidate_value=f"{cdd_test_result.return_value!r}",
+            test_src_code=test_src_code,
+            candidate_pytest_error=cdd_pytest_error,
+            original_pass=original_test_result.did_pass,
+            candidate_pass=cdd_test_result.did_pass,
+            original_pytest_error=original_pytest_error,
+        )
+    )
 
-...
+# Stdout diff
 if (original_test_result.stdout and cdd_test_result.stdout) and not comparator(
     original_test_result.stdout, cdd_test_result.stdout
 ):
-    test_diff.scope = TestDiffScope.STDOUT
-    test_diff.original_value = str(original_test_result.stdout)
-    test_diff.candidate_value = str(cdd_test_result.stdout)
-    test_diffs.append(test_diff)
+    test_diffs.append(
+        TestDiff(
+            scope=TestDiffScope.STDOUT,
+            original_value=str(original_test_result.stdout),
+            candidate_value=str(cdd_test_result.stdout),
+            test_src_code=test_src_code,
+            candidate_pytest_error=cdd_pytest_error,
+            original_pass=original_test_result.did_pass,
+            candidate_pass=cdd_test_result.did_pass,
+            original_pytest_error=original_pytest_error,
+        )
+    )
 
+# Did-pass diff
 if original_test_result.test_type in {
     TestType.EXISTING_UNIT_TEST,
     TestType.CONCOLIC_COVERAGE_TEST,
     TestType.GENERATED_REGRESSION,
     TestType.REPLAY_TEST,
 } and (cdd_test_result.did_pass != original_test_result.did_pass):
-    test_diff.scope = TestDiffScope.DID_PASS
-    test_diff.original_value = str(original_test_result.did_pass)
-    test_diff.candidate_value = str(cdd_test_result.did_pass)
-    test_diffs.append(test_diff)
+    test_diffs.append(
+        TestDiff(
+            scope=TestDiffScope.DID_PASS,
+            original_value=str(original_test_result.did_pass),
+            candidate_value=str(cdd_test_result.did_pass),
+            test_src_code=test_src_code,
+            candidate_pytest_error=cdd_pytest_error,
+            original_pass=original_test_result.did_pass,
+            candidate_pass=cdd_test_result.did_pass,
+            original_pytest_error=original_pytest_error,
+        )
+    )
Suggestion importance[1-10]: 8

__

Why: The current code mutates and reuses a single TestDiff object across multiple scopes, risking mixed data; creating separate instances per mismatch is correct and significantly improves accuracy of reported diffs.

Medium
Prevent None access in comparisons

Guard against original_test_result being None before accessing its fields. When a
test id exists only on one side, the function currently short-circuits, but these
lines can still execute earlier and raise AttributeError. Add a defensive check to
only compute failure lookups when both results exist.

codeflash/verification/equivalence.py [31-42]

 candidate_test_failures = candidate_results.test_failures
 original_test_failures = original_results.test_failures
-cdd_pytest_error = (
-    candidate_test_failures.get(original_test_result.id.test_fn_qualified_name(), "")
-    if candidate_test_failures
-    else ""
-)
-original_pytest_error = (
-    original_test_failures.get(original_test_result.id.test_fn_qualified_name(), "")
-    if original_test_failures
-    else ""
-)
 
+cdd_pytest_error = ""
+original_pytest_error = ""
+
+if original_test_result is not None:
+    test_name = original_test_result.id.test_fn_qualified_name()
+    if candidate_test_failures:
+        cdd_pytest_error = candidate_test_failures.get(test_name, "")
+    if original_test_failures:
+        original_pytest_error = original_test_failures.get(test_name, "")
+
Suggestion importance[1-10]: 7

__

Why: This guards against accessing original_test_result.id when original_test_result could be None, which would raise an AttributeError. The change is accurate and low-risk, improving robustness without altering logic.

Medium
General
Fix mismatch ratio calculation

Use the count of compared test invocations, not the length of the TestResults
container, to compute the unmatched percentage. len(candidate_behavior_results) may
not reflect the number of tests compared and can be zero-dividing or misleading;
base the denominator on the number of unique ids or on len(diffs) + matched_count.

codeflash/optimization/function_optimizer.py [1869-1876]

-result_unmatched_perc = len(diffs) / len(candidate_behavior_results)
+total_compared = len(candidate_behavior_results.get_all_unique_invocation_loop_ids())
+result_unmatched_perc = (len(diffs) / total_compared) if total_compared else 1.0
Suggestion importance[1-10]: 6

__

Why: Using the count of unique invocation ids avoids misleading denominators and division by zero; the proposal is reasonable and improves correctness, though its impact is moderate within the broader flow.

Low

Previous suggestions

Suggestions up to commit 5830a70
CategorySuggestion                                                                                                                                    Impact
Possible issue
Safely access optional mapping

Guard access to test_failures since it can be None and avoid AttributeError. Also
handle missing keys safely to keep comparison robust when no failures were parsed.

codeflash/verification/equivalence.py [43]

-candidate_pytest_error = candidate_results.test_failures.get(original_test_result.id.test_function_name)
+candidate_pytest_error = None
+if getattr(candidate_results, "test_failures", None):
+    candidate_pytest_error = candidate_results.test_failures.get(original_test_result.id.test_function_name)
Suggestion importance[1-10]: 8

__

Why: test_failures is declared Optional in TestResults, so direct .get can raise if None; guarding prevents an AttributeError and aligns with new parsing logic.

Medium
Ensure recursion limit restoration

Preserve the recursion limit restoration even on early returns to avoid leaving the
process with a higher limit. Move recursion limit increase before any early return
or ensure restoration in all paths.

codeflash/verification/equivalence.py [30-35]

 if len(original_results) == 0 or len(candidate_results) == 0:
-    return False, []  # empty test results are not equal
+    return False, []
+original_recursion_limit = sys.getrecursionlimit()
+try:
+    if original_recursion_limit < INCREASED_RECURSION_LIMIT:
+        sys.setrecursionlimit(INCREASED_RECURSION_LIMIT)
+    # ... rest of the function body unchanged ...
+finally:
+    sys.setrecursionlimit(original_recursion_limit)
Suggestion importance[1-10]: 6

__

Why: Early return before saving/restoring the recursion limit can skip restoration if that logic ever moves; wrapping with try/finally improves robustness though current early return happens before any change.

Low
General
Use logger instead of print

Replace print with the existing logger to keep consistent output handling and avoid
noisy stdout in library code. Log the exception with traceback for better
diagnostics.

codeflash/verification/equivalence.py [76-87]

 try:
-    print(
-        f"File Name: {original_test_result.file_name}\n"
-        f"Test Type: {original_test_result.test_type}\n"
-        f"Verification Type: {original_test_result.verification_type}\n"
-        f"Invocation ID: {original_test_result.id}\n"
-        f"Original return value: {original_test_result.return_value}\n"
-        f"Candidate return value: {cdd_test_result.return_value}\n"
+    logger.debug(
+        "File Name: %s\nTest Type: %s\nVerification Type: %s\nInvocation ID: %s\nOriginal return value: %r\nCandidate return value: %r",
+        original_test_result.file_name,
+        original_test_result.test_type,
+        original_test_result.verification_type,
+        original_test_result.id,
+        original_test_result.return_value,
+        cdd_test_result.return_value,
     )
-except Exception as e:
-    logger.error(e)
+except Exception:
+    logger.exception("Failed to log return value comparison details")
 break
Suggestion importance[1-10]: 7

__

Why: Replacing print with logger.debug/exception keeps output consistent and avoids noisy stdout; the improved code accurately mirrors the existing block’s intent with better diagnostics.

Medium

@mohammedahmed18 mohammedahmed18 marked this pull request as draft November 27, 2025 14:27
Comment on lines 338 to 343
x = prev[index1]
y = prev[index1 + 1]
z = curr[index1]
min_xy = min(x, y)
min_xyz = min(z, min_xy)
curr[index1 + 1] = 1 + min_xyz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚡️Codeflash found 73% (0.73x) speedup for levenshtein_distance in codeflash/discovery/functions_to_optimize.py

⏱️ Runtime : 2.04 seconds 1.18 seconds (best of 8 runs)

📝 Explanation and details

The optimized version achieves a 73% speedup by eliminating Python's built-in min() function calls and replacing them with direct comparisons. This is a targeted micro-optimization that addresses one of the most expensive operations in the Levenshtein distance algorithm.

Key optimization:

  • Replaced min() calls with direct comparisons: The original code used min(x, y) and min(z, min_xy) which create temporary tuples and invoke Python's generic minimum function. The optimized version uses nested if statements to find the minimum value directly, avoiding function call overhead and tuple creation.

Why this provides a speedup:

  • The min() function in Python has significant overhead for small numbers of arguments, especially when called millions of times in nested loops
  • Direct comparisons (if x < y) are primitive operations that execute much faster than function calls
  • Eliminates temporary tuple creation that min() uses internally
  • Reduces the call stack depth in the inner loop

Performance impact by test case type:

  • Identical/similar strings: 55-65% faster - benefits from reduced overhead in character matching paths
  • Completely different strings: 109-121% faster - maximizes benefit since every character comparison triggers the min() replacement logic
  • Large strings with many differences: 83-93% faster - compounds the per-operation savings across many iterations
  • Small strings: 15-50% faster - still benefits but overhead reduction is less pronounced

The optimization is particularly effective for the Levenshtein algorithm because the min() operation occurs in the innermost loop that executes O(n×m) times, making even small per-call improvements significant when multiplied across all iterations.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 148 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 96.6%
🌀 Generated Regression Tests and Runtime
from __future__ import annotations

# imports
import pytest  # used for our unit tests

from codeflash.discovery.functions_to_optimize import levenshtein_distance

# unit tests

# 1. Basic Test Cases


def test_identical_strings():
    # Levenshtein distance between identical strings should be 0
    codeflash_output = levenshtein_distance("kitten", "kitten")  # 15.8μs -> 9.90μs (60.0% faster)
    codeflash_output = levenshtein_distance("", "")  # 480ns -> 441ns (8.84% faster)
    codeflash_output = levenshtein_distance("a", "a")  # 2.09μs -> 1.95μs (7.22% faster)


def test_single_insertion():
    # Inserting one character
    codeflash_output = levenshtein_distance("kitten", "kitte")  # 13.0μs -> 8.69μs (49.4% faster)
    codeflash_output = levenshtein_distance("kitte", "kitten")  # 10.1μs -> 6.12μs (65.6% faster)
    codeflash_output = levenshtein_distance("", "a")  # 421ns -> 421ns (0.000% faster)
    codeflash_output = levenshtein_distance("a", "")  # 360ns -> 361ns (0.277% slower)


def test_single_deletion():
    # Deleting one character
    codeflash_output = levenshtein_distance("kitten", "kittn")  # 12.6μs -> 8.52μs (48.5% faster)
    codeflash_output = levenshtein_distance("kittn", "kitten")  # 10.0μs -> 5.96μs (67.9% faster)


def test_single_substitution():
    # Substituting one character
    codeflash_output = levenshtein_distance("kitten", "sitten")  # 14.8μs -> 9.26μs (60.2% faster)
    codeflash_output = levenshtein_distance("kitten", "kitteb")  # 11.7μs -> 6.81μs (72.4% faster)
    codeflash_output = levenshtein_distance("a", "b")  # 2.22μs -> 1.89μs (17.5% faster)


def test_multiple_operations():
    # Multiple edits required
    codeflash_output = levenshtein_distance("kitten", "sitting")  # 16.4μs -> 10.3μs (58.8% faster)
    codeflash_output = levenshtein_distance("flaw", "lawn")  # 6.70μs -> 4.47μs (50.0% faster)


def test_empty_and_nonempty():
    # One string empty, one non-empty
    codeflash_output = levenshtein_distance("", "abc")  # 751ns -> 751ns (0.000% faster)
    codeflash_output = levenshtein_distance("abc", "")  # 431ns -> 451ns (4.43% slower)


# 2. Edge Test Cases


def test_both_empty():
    # Both strings are empty
    codeflash_output = levenshtein_distance("", "")  # 781ns -> 761ns (2.63% faster)


def test_one_char_vs_empty():
    # One string is a single character, other is empty
    codeflash_output = levenshtein_distance("a", "")  # 771ns -> 781ns (1.28% slower)
    codeflash_output = levenshtein_distance("", "z")  # 431ns -> 441ns (2.27% slower)


def test_case_sensitivity():
    # Case should matter
    codeflash_output = levenshtein_distance("abc", "Abc")  # 7.70μs -> 5.87μs (31.1% faster)
    codeflash_output = levenshtein_distance("ABC", "abc")  # 5.14μs -> 3.73μs (37.9% faster)


def test_unicode_characters():
    # Unicode characters
    codeflash_output = levenshtein_distance("café", "cafe")  # 9.39μs -> 6.81μs (37.8% faster)
    codeflash_output = levenshtein_distance("naïve", "naive")  # 9.85μs -> 5.75μs (71.3% faster)
    codeflash_output = levenshtein_distance("你好", "你")  # 3.12μs -> 2.81μs (10.7% faster)
    codeflash_output = levenshtein_distance("你好", "您好")  # 3.10μs -> 2.71μs (14.5% faster)


def test_completely_different_strings():
    # No characters in common
    codeflash_output = levenshtein_distance("abc", "xyz")  # 7.45μs -> 5.61μs (32.9% faster)
    codeflash_output = levenshtein_distance("123", "abc")  # 5.14μs -> 3.46μs (48.7% faster)


def test_prefix_and_suffix():
    # One string is a prefix or suffix of the other
    codeflash_output = levenshtein_distance("abc", "abcd")  # 7.88μs -> 6.11μs (29.0% faster)
    codeflash_output = levenshtein_distance("abcd", "abc")  # 5.18μs -> 3.78μs (37.1% faster)
    codeflash_output = levenshtein_distance("abc", "zabc")  # 5.23μs -> 3.41μs (53.6% faster)
    codeflash_output = levenshtein_distance("abc", "abcz")  # 4.87μs -> 3.19μs (52.8% faster)


def test_repeated_characters():
    # Strings with repeated characters
    codeflash_output = levenshtein_distance("aaa", "aaaa")  # 4.89μs -> 4.79μs (2.11% faster)
    codeflash_output = levenshtein_distance("aaaa", "aaa")  # 2.92μs -> 3.06μs (4.89% slower)
    codeflash_output = levenshtein_distance("aaa", "bbb")  # 5.54μs -> 3.56μs (55.7% faster)


def test_numbers_and_symbols():
    # Strings with digits and symbols
    codeflash_output = levenshtein_distance("1234", "1243")  # 8.68μs -> 6.73μs (28.9% faster)
    codeflash_output = levenshtein_distance("!@#$", "!@#")  # 5.76μs -> 4.13μs (39.6% faster)
    codeflash_output = levenshtein_distance("!@#$", "$#@!")  # 6.25μs -> 4.45μs (40.5% faster)


def test_long_identical_strings():
    # Long identical strings (edge, but also performance)
    s = "a" * 100
    codeflash_output = levenshtein_distance(s, s)  # 519μs -> 535μs (2.86% slower)


def test_long_strings_one_difference():
    # Long strings with one difference at the end
    s1 = "a" * 999 + "b"
    s2 = "a" * 1000
    codeflash_output = levenshtein_distance(s1, s2)  # 60.1ms -> 59.3ms (1.27% faster)
    codeflash_output = levenshtein_distance(s2, s1)  # 60.3ms -> 59.7ms (1.11% faster)


def test_long_strings_completely_different():
    # Long completely different strings
    s1 = "a" * 500
    s2 = "b" * 500
    codeflash_output = levenshtein_distance(s1, s2)  # 67.1ms -> 30.4ms (121% faster)


# 3. Large Scale Test Cases


def test_large_equal_strings():
    # Large identical strings
    s = "abcde" * 200  # length 1000
    codeflash_output = levenshtein_distance(s, s)  # 242ms -> 114ms (111% faster)


def test_large_one_insertion():
    # Large string with one insertion
    s1 = "a" * 500 + "b" + "a" * 499  # length 1000
    s2 = "a" * 1000
    codeflash_output = levenshtein_distance(s1, s2)  # 58.2ms -> 56.2ms (3.59% faster)


def test_large_one_substitution():
    # Large string with one substitution in the middle
    s1 = "a" * 499 + "b" + "a" * 500
    s2 = "a" * 1000
    codeflash_output = levenshtein_distance(s1, s2)  # 57.9ms -> 57.2ms (1.16% faster)


def test_large_completely_different():
    # Large strings, all substitutions
    s1 = "a" * 1000
    s2 = "b" * 1000
    codeflash_output = levenshtein_distance(s1, s2)  # 274ms -> 129ms (112% faster)


def test_large_half_and_half():
    # Half the string is the same, half is different
    s1 = "a" * 500 + "b" * 500
    s2 = "a" * 1000
    codeflash_output = levenshtein_distance(s1, s2)  # 171ms -> 93.5ms (83.5% faster)


def test_large_with_unicode():
    # Large string with unicode characters
    s1 = "你" * 500 + "好" * 500
    s2 = "你" * 1000
    codeflash_output = levenshtein_distance(s1, s2)  # 174ms -> 96.3ms (81.0% faster)


# 4. Additional Robustness Cases


@pytest.mark.parametrize(
    "s1,s2,expected",
    [
        ("", "", 0),
        ("", "abc", 3),
        ("abc", "", 3),
        ("abc", "abc", 0),
        ("abc", "ab", 1),
        ("a", "b", 1),
        ("", "a", 1),
        ("a", "", 1),
        ("kitten", "sitting", 3),
        ("flaw", "lawn", 2),
        ("intention", "execution", 5),
        ("distance", "difference", 5),
        ("abcdef", "azced", 3),
        ("short", "ports", 3),
    ],
)
def test_various_cases(s1, s2, expected):
    # Parametrized test for various scenarios
    codeflash_output = levenshtein_distance(s1, s2)  # 130μs -> 85.5μs (52.5% faster)


# 5. Commutativity property (Levenshtein distance is symmetric)
def test_commutativity():
    pairs = [
        ("kitten", "sitting"),
        ("flaw", "lawn"),
        ("abc", "xyz"),
        ("", "abc"),
        ("a" * 500, "b" * 500),
        ("abcde" * 100, "edcba" * 100),
    ]
    for s1, s2 in pairs:
        codeflash_output = levenshtein_distance(s1, s2)
        d1 = codeflash_output  # 126ms -> 58.6ms (116% faster)
        codeflash_output = levenshtein_distance(s2, s1)
        d2 = codeflash_output  # 126ms -> 58.8ms (115% faster)


# 6. Triangle inequality property
def test_triangle_inequality():
    # For Levenshtein distance, d(x,z) <= d(x,y) + d(y,z)
    triples = [("kitten", "sitting", "sittin"), ("abc", "abd", "ab"), ("a" * 100, "a" * 99 + "b", "a" * 99 + "c")]
    for x, y, z in triples:
        codeflash_output = levenshtein_distance(x, z)
        d_xz = codeflash_output  # 557μs -> 537μs (3.89% faster)
        codeflash_output = levenshtein_distance(x, y)
        d_xy = codeflash_output  # 553μs -> 532μs (3.98% faster)
        codeflash_output = levenshtein_distance(y, z)
        d_yz = codeflash_output


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from __future__ import annotations

# imports
import pytest  # used for our unit tests

from codeflash.discovery.functions_to_optimize import levenshtein_distance

# unit tests


# 1. Basic Test Cases
def test_identical_strings():
    # Identical strings should have distance 0
    codeflash_output = levenshtein_distance("kitten", "kitten")  # 14.4μs -> 9.29μs (55.1% faster)
    codeflash_output = levenshtein_distance("", "")  # 611ns -> 521ns (17.3% faster)
    codeflash_output = levenshtein_distance("a", "a")  # 2.03μs -> 1.98μs (2.52% faster)


def test_single_insertion():
    # One insertion required
    codeflash_output = levenshtein_distance("kitten", "kittena")  # 16.1μs -> 9.74μs (65.7% faster)
    codeflash_output = levenshtein_distance("abc", "abcd")  # 5.73μs -> 3.86μs (48.6% faster)


def test_single_deletion():
    # One deletion required
    codeflash_output = levenshtein_distance("kitten", "kittn")  # 12.9μs -> 8.69μs (49.0% faster)
    codeflash_output = levenshtein_distance("abcd", "abc")  # 5.71μs -> 4.03μs (41.8% faster)


def test_single_substitution():
    # One substitution required
    codeflash_output = levenshtein_distance("kitten", "kittan")  # 14.5μs -> 9.22μs (57.3% faster)
    codeflash_output = levenshtein_distance("abc", "adc")  # 4.67μs -> 3.47μs (34.7% faster)


def test_multiple_operations():
    # Multiple operations needed
    codeflash_output = levenshtein_distance("kitten", "sitting")  # 16.6μs -> 10.1μs (65.1% faster)
    codeflash_output = levenshtein_distance("flaw", "lawn")  # 6.70μs -> 4.50μs (49.0% faster)
    codeflash_output = levenshtein_distance("gumbo", "gambol")  # 10.7μs -> 6.22μs (72.6% faster)


def test_case_sensitivity():
    # Should be case-sensitive
    codeflash_output = levenshtein_distance("a", "A")  # 4.12μs -> 3.55μs (16.1% faster)
    codeflash_output = levenshtein_distance("Python", "python")  # 13.1μs -> 7.71μs (69.8% faster)


def test_completely_different_strings():
    # All characters different
    codeflash_output = levenshtein_distance("abc", "xyz")  # 7.57μs -> 5.60μs (35.2% faster)
    codeflash_output = levenshtein_distance("aaa", "bbb")  # 4.95μs -> 3.26μs (52.0% faster)


# 2. Edge Test Cases


def test_empty_strings():
    # One or both strings empty
    codeflash_output = levenshtein_distance("", "abc")  # 822ns -> 751ns (9.45% faster)
    codeflash_output = levenshtein_distance("abc", "")  # 441ns -> 460ns (4.13% slower)
    codeflash_output = levenshtein_distance("", "")  # 290ns -> 321ns (9.66% slower)


def test_one_character_strings():
    # Single character to/from empty or another char
    codeflash_output = levenshtein_distance("a", "")  # 742ns -> 771ns (3.76% slower)
    codeflash_output = levenshtein_distance("", "a")  # 431ns -> 411ns (4.87% faster)
    codeflash_output = levenshtein_distance("a", "b")  # 3.80μs -> 3.29μs (15.5% faster)


def test_unicode_strings():
    # Unicode and multi-byte characters
    codeflash_output = levenshtein_distance("café", "cafe")  # 9.28μs -> 6.86μs (35.2% faster)
    codeflash_output = levenshtein_distance("你好", "你们好")  # 4.51μs -> 3.69μs (22.3% faster)
    codeflash_output = levenshtein_distance("🙂", "🙃")  # 2.33μs -> 2.08μs (12.0% faster)
    codeflash_output = levenshtein_distance("a🙂b", "a🙃b")  # 4.81μs -> 3.54μs (36.0% faster)


def test_whitespace_and_special_chars():
    # Strings with whitespace and special characters
    codeflash_output = levenshtein_distance("a b", "ab")  # 6.26μs -> 5.17μs (21.1% faster)
    codeflash_output = levenshtein_distance("a_b", "a-b")  # 5.12μs -> 3.48μs (47.3% faster)
    codeflash_output = levenshtein_distance("hello!", "hello")  # 10.1μs -> 5.99μs (68.2% faster)


def test_long_repeated_chars():
    # Strings with repeated characters
    codeflash_output = levenshtein_distance("aaaaa", "aaaa")  # 5.47μs -> 5.39μs (1.48% faster)
    codeflash_output = levenshtein_distance("aaaaa", "bbbbb")  # 10.9μs -> 6.39μs (71.0% faster)


def test_palindromes_and_reverses():
    # Palindrome and reversed strings
    codeflash_output = levenshtein_distance("abcde", "edcba")  # 11.9μs -> 7.68μs (54.8% faster)


def test_large_difference_in_length():
    # One string much longer than the other
    codeflash_output = levenshtein_distance("a", "a" * 100)  # 25.4μs -> 25.7μs (1.09% slower)
    codeflash_output = levenshtein_distance("b" * 100, "b")  # 23.3μs -> 23.4μs (0.474% slower)


def test_strings_with_numbers():
    # Strings with numbers
    codeflash_output = levenshtein_distance("abc123", "abc124")  # 14.5μs -> 9.02μs (60.9% faster)
    codeflash_output = levenshtein_distance("12345", "54321")  # 9.13μs -> 5.82μs (56.8% faster)


# 3. Large Scale Test Cases


def test_large_identical_strings():
    # Large identical strings should have distance 0
    s = "a" * 500
    codeflash_output = levenshtein_distance(s, s)  # 13.9ms -> 13.5ms (2.37% faster)


def test_large_one_insertion():
    # Large string with one insertion
    s1 = "a" * 499
    s2 = "a" * 250 + "b" + "a" * 249
    codeflash_output = levenshtein_distance(s1, s2)  # 13.8ms -> 13.6ms (1.61% faster)


def test_large_one_deletion():
    # Large string with one deletion
    s1 = "a" * 500
    s2 = "a" * 499
    codeflash_output = levenshtein_distance(s1, s2)  # 13.7ms -> 13.5ms (1.69% faster)


def test_large_one_substitution():
    # Large string with one substitution in the middle
    s1 = "a" * 250 + "b" + "a" * 249
    s2 = "a" * 500
    codeflash_output = levenshtein_distance(s1, s2)  # 13.9ms -> 13.5ms (2.27% faster)


def test_large_completely_different():
    # Large strings, all characters different
    s1 = "a" * 500
    s2 = "b" * 500
    codeflash_output = levenshtein_distance(s1, s2)  # 67.2ms -> 30.7ms (119% faster)


def test_large_partial_overlap():
    # Large strings with partial overlap
    s1 = "a" * 250 + "b" * 250
    s2 = "a" * 200 + "b" * 300
    # 50 a's replaced with b's
    codeflash_output = levenshtein_distance(s1, s2)  # 41.7ms -> 21.7ms (92.6% faster)


def test_large_strings_with_unicode():
    # Large strings with unicode characters
    s1 = "é" * 500
    s2 = "e" * 500
    codeflash_output = levenshtein_distance(s1, s2)  # 67.2ms -> 30.4ms (121% faster)


def test_large_strings_with_alternating_chars():
    # Alternating characters
    s1 = "ab" * 250
    s2 = "ba" * 250
    # Each position is different except for the middle if even length
    codeflash_output = levenshtein_distance(s1, s2)  # 41.5ms -> 21.5ms (92.9% faster)


# 4. Additional Edge Cases


def test_nonequivalent_lengths_and_content():
    # Both length and content differ
    codeflash_output = levenshtein_distance("abcdefg", "xyz")  # 12.9μs -> 8.40μs (53.8% faster)


def test_substring():
    # One string is a substring of the other
    codeflash_output = levenshtein_distance("abcdef", "abc")  # 9.93μs -> 7.42μs (33.7% faster)
    codeflash_output = levenshtein_distance("abc", "abcdef")  # 7.66μs -> 4.98μs (53.7% faster)


def test_strings_with_tabs_and_newlines():
    # Special whitespace characters
    codeflash_output = levenshtein_distance("abc\tdef", "abcdef")  # 16.8μs -> 10.3μs (62.8% faster)
    codeflash_output = levenshtein_distance("abc\ndef", "abcdef")  # 13.7μs -> 7.80μs (76.0% faster)


def test_zero_length_and_long_string():
    # One empty, one long
    codeflash_output = levenshtein_distance("", "a" * 999)  # 912ns -> 811ns (12.5% faster)
    codeflash_output = levenshtein_distance("b" * 999, "")  # 631ns -> 541ns (16.6% faster)


# 5. Determinism and Symmetry


@pytest.mark.parametrize(
    "s1,s2",
    [
        ("kitten", "sitting"),
        ("flaw", "lawn"),
        ("", "abc"),
        ("abc", ""),
        ("abc", "cba"),
        ("abc", "abc"),
        ("", ""),
        ("a", "b"),
        ("abc123", "abc124"),
        ("a" * 500, "a" * 500),
    ],
)
def test_symmetry(s1, s2):
    # Levenshtein distance is symmetric
    codeflash_output = levenshtein_distance(s1, s2)  # 13.8ms -> 13.5ms (1.90% faster)


# 6. Type robustness


def test_non_string_inputs():
    # Should raise TypeError if input is not string
    with pytest.raises(TypeError):
        levenshtein_distance(123, "abc")
    with pytest.raises(TypeError):
        levenshtein_distance("abc", None)
    with pytest.raises(TypeError):
        levenshtein_distance(["a", "b"], "ab")
    with pytest.raises(TypeError):
        levenshtein_distance("ab", ["a", "b"])


# 7. Stress test: Large but feasible within constraints


def test_large_strings_max_size():
    # Both strings at the upper limit (1000 chars)
    s1 = "a" * 1000
    s2 = "b" * 1000
    codeflash_output = levenshtein_distance(s1, s2)  # 272ms -> 130ms (109% faster)


def test_large_strings_one_char_difference():
    # 999 identical, 1 different
    s1 = "a" * 999 + "b"
    s2 = "a" * 1000
    codeflash_output = levenshtein_distance(s1, s2)  # 58.4ms -> 57.5ms (1.56% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To test or edit this optimization locally git merge codeflash/optimize-pr945-2025-11-27T14.39.26

Click to see suggested changes
Suggested change
x = prev[index1]
y = prev[index1 + 1]
z = curr[index1]
min_xy = min(x, y)
min_xyz = min(z, min_xy)
curr[index1 + 1] = 1 + min_xyz
# Avoid min() function call overhead by using direct comparisons
x = prev[index1]
y = prev[index1 + 1]
z = curr[index1]
if x < y:
if x < z:
curr[index1 + 1] = 1 + x
else:
curr[index1 + 1] = 1 + z
elif y < z:
curr[index1 + 1] = 1 + y
else:
curr[index1 + 1] = 1 + z

Static Badge

The optimized code achieves a **15% speedup** through several targeted micro-optimizations that reduce computational overhead in the parsing loop:

**Key Optimizations:**

1. **Single-pass boundary search**: Instead of checking both conditions (`start_line != -1 and end_line != -1`) on every iteration, the optimized version uses `None` values and breaks immediately when both markers are found, eliminating redundant condition checks.

2. **Fast-path string matching**: Before calling the expensive `.startswith("_______")` method, it first checks if `line[0] == "_"`, avoiding the method call for most lines that don't start with underscores.

3. **Method lookup optimization**: Pulls `current_failure_lines.append` into a local variable to avoid repeated attribute lookups in the hot loop where failure lines are processed.

4. **Memory-efficient list management**: Uses `current_failure_lines.clear()` instead of creating new list objects (`current_failure_lines = []`), reducing object allocation pressure.

**Performance Impact:**
The optimizations show the most significant gains in large-scale scenarios:
- **Large failure sets**: 14.2% faster with 500 failures, 14.0% faster with 999 failures  
- **Large output**: 29.2% faster for single failures with 1000 lines of output
- **Complex scenarios**: 22.3% faster with 50 cases having 10 lines each

**Hot Path Context:**
Based on the function reference, `parse_test_failures_from_stdout` is called from `parse_test_results`, which appears to be part of a test optimization pipeline. The function processes pytest stdout to extract failure information, making it performance-critical when dealing with large test suites or verbose test outputs. The 15% improvement becomes meaningful when processing hundreds of test failures in CI/CD environments or during iterative code optimization workflows.
@codeflash-ai
Copy link
Contributor

codeflash-ai bot commented Nov 27, 2025

⚡️ Codeflash found optimizations for this PR

📄 16% (0.16x) speedup for parse_test_failures_from_stdout in codeflash/verification/parse_test_output.py

⏱️ Runtime : 2.76 milliseconds 2.39 milliseconds (best of 250 runs)

A dependent PR with the suggested changes has been created. Please review:

If you approve, it will be merged into this PR (branch feat/feedback-loop-for-unmatched-test-results).

Static Badge

…25-11-27T14.49.01

⚡️ Speed up function `parse_test_failures_from_stdout` by 16% in PR #945 (`feat/feedback-loop-for-unmatched-test-results`)
@codeflash-ai
Copy link
Contributor

codeflash-ai bot commented Nov 27, 2025

@codeflash-ai
Copy link
Contributor

codeflash-ai bot commented Nov 27, 2025

⚡️ Codeflash found optimizations for this PR

📄 655% (6.55x) speedup for compare_test_results in codeflash/verification/equivalence.py

⏱️ Runtime : 90.0 milliseconds 11.9 milliseconds (best of 5 runs)

A dependent PR with the suggested changes has been created. Please review:

If you approve, it will be merged into this PR (branch feat/feedback-loop-for-unmatched-test-results).

Static Badge

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Codeflash Bot seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

console.rule()
return Failure("Test results did not match the test results of the original code.")

def repair_if_possible() -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: not a big fan of nested functions

)

test_src_code = original_test_result.id.get_src_code(original_test_result.file_name)
test_diff = TestDiff(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're constructing it now for cases when the test cases match, this would slow down the comparator, something to improve

Codeflash Bot and others added 5 commits December 10, 2025 18:50
The optimization achieves a **45% speedup** by restructuring how Pydantic model instances are created during markdown parsing. 

**Key Change**: Instead of creating an empty `CodeStringsMarkdown()` object and repeatedly appending to its `code_strings` list (which triggers Pydantic field validation on each append), the optimized version collects all code blocks into a plain Python list first, then creates the Pydantic model once with the complete list.

**Why This is Faster**: 
- **Reduced Pydantic overhead**: The original code performed O(n) Pydantic field validations as each `CodeString` was appended. The optimization reduces this to O(1) by doing a single model instantiation.
- **Fewer object mutations**: Plain list operations (`code_string_list.append()`) are significantly faster than mutating Pydantic model fields.
- **Profiler evidence**: The line creating `CodeStringsMarkdown()` dropped from 89.6% of function time (18.05ms) to 81% (8.45ms) - nearly a 2x improvement on the bottleneck line.

**Impact on Workloads**: This optimization is particularly effective for scenarios processing multiple markdown code blocks (as shown in test results where larger datasets see 46-47% improvements). Since `parse_markdown_code` is called in a tight loop within `_get_valid_candidates`, the per-call savings compound significantly when processing batches of optimization candidates.

**Test Case Performance**: The optimization shows consistent 25-47% improvements across various test scenarios, with the largest gains on tests with multiple candidates or code blocks, confirming the batching approach scales well.
@codeflash-ai
Copy link
Contributor

codeflash-ai bot commented Dec 11, 2025

⚡️ Codeflash found optimizations for this PR

📄 45% (0.45x) speedup for AiServiceClient._get_valid_candidates in codeflash/api/aiservice.py

⏱️ Runtime : 3.25 milliseconds 2.24 milliseconds (best of 106 runs)

A dependent PR with the suggested changes has been created. Please review:

If you approve, it will be merged into this PR (branch feat/feedback-loop-for-unmatched-test-results).

Static Badge

…25-12-11T15.08.58

⚡️ Speed up method `AiServiceClient._get_valid_candidates` by 45% in PR #945 (`feat/feedback-loop-for-unmatched-test-results`)
@codeflash-ai
Copy link
Contributor

codeflash-ai bot commented Dec 11, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants