Skip to content

[BUG] Remove hard coded variables and better class extraction from KB#11490

Merged
lucas-koontz merged 9 commits into
mainfrom
fix/update-rerank-config
Sep 18, 2025
Merged

[BUG] Remove hard coded variables and better class extraction from KB#11490
lucas-koontz merged 9 commits into
mainfrom
fix/update-rerank-config

Conversation

@lucas-koontz

Copy link
Copy Markdown
Contributor

Description

This PR adds configurability for reranker parameters and fixes a critical bug that occurred when using higher max_tokens values.

Key Changes:

  1. Made reranker parameters configurable: The variables n, logprobs, top_logprobs, and max_tokens can now be configured via:

    • Environment variables (MINDSDB_RERANKER_N, MINDSDB_RERANKER_LOGPROBS, etc.)
    • Configuration files (in default_reranking_model section)
    • Direct configuration through RerankerConfig class
  2. Fixed token extraction bug: Resolved an issue where increasing max_tokens beyond 4 would cause the reranker to incorrectly extract class probabilities from wrong tokens (quotes, newlines, end tokens) instead of the actual class number token.

Before: Parameters were hardcoded as n=1, logprobs=True, top_logprobs=4, max_tokens=3
After: Parameters are configurable with the same defaults, and the system correctly extracts class probabilities regardless of max_tokens value.

Type of change

  • 🐛 Bug fix (non-breaking change which fixes an issue)
  • ⚡ New feature (non-breaking change which adds functionality)
  • 📢 Breaking change (fix or feature that would cause existing functionality not to work as expected)
  • 📄 This change requires a documentation update

Verification Process

To ensure the changes are working as expected:

  • Test Location: MindsDB RAG reranker functionality (knowledge base queries with reranking enabled)
  • Verification Steps:
    1. Test configurability:
      export MINDSDB_RERANKER_MAX_TOKENS=100
      export MINDSDB_RERANKER_N=2
      export MINDSDB_RERANKER_TOP_LOGPROBS=8
    2. Test the bug fix: Run queries with max_tokens=100 (previously would return 0.0 relevance scores for all documents)
    3. Test backward compatibility: Ensure existing functionality works without any configuration changes
    4. Verify logging: Check that the new Reranker selected_class_token log shows the correct class number token being selected

Additional Media:

  • Debug logs were provided showing the before/after behavior:
    • max_tokens=4 (working): Correctly extracts from class number token
    • max_tokens=100 (was broken, now fixed): Previously extracted from wrong tokens, now correctly finds class number token

Checklist:

  • My code follows the style guidelines(PEP 8) of MindsDB.
  • I have appropriately commented on my code, especially in complex areas.
  • Necessary documentation updates are either made or tracked in issues.
  • Relevant unit and integration tests are updated or added.

Technical Details

Files Modified:

  • mindsdb/integrations/utilities/rag/settings.py: Added default constants and extended RerankerConfig
  • mindsdb/integrations/utilities/rag/rerankers/base_reranker.py: Added configurable parameters and fixed token extraction logic
  • mindsdb/utilities/config.py: Added environment variable support for reranker parameters

Configuration Methods:

# Environment variables
export MINDSDB_RERANKER_N=1
export MINDSDB_RERANKER_LOGPROBS=true  
export MINDSDB_RERANKER_TOP_LOGPROBS=4
export MINDSDB_RERANKER_MAX_TOKENS=3
# Direct configuration
config = RerankerConfig(
    n=2,
    logprobs=True, 
    top_logprobs=8,
    max_tokens=100
)
// Configuration file
{
  "default_reranking_model": {
    "n": 2,
    "logprobs": true,
    "top_logprobs": 8, 
    "max_tokens": 100
  }
}

Appendix: The Token Extraction Bug: Detailed Examples

Problem: Wrong Token Selection with Higher max_tokens

The issue was that the original code always took the last token to extract class probabilities, but with higher max_tokens values, the model generates additional formatting tokens after the class number.

Example 1: Working Case (max_tokens=4)

Token Sequence:

[0] token='class'  logprob=-0.64
[1] token='_'      logprob=-0.00003  
[2] token='1'      logprob=-0.078    ← CLASS NUMBER (what we want)
[3] token=''       logprob=-0.0017   ← END TOKEN (last token)

Old Logic:

final_token_logprob = token_logprobs[-1]  # Takes token[3] = '' (empty)
  • Result: Extracts from empty token, but still works because the sequence is short

Example 2: Broken Case (max_tokens=100)

Token Sequence from your logs:

[0] token='"'      logprob=-1.0
[1] token='class'  logprob=-0.016
[2] token='_'      logprob=-0.000025
[3] token='2'      logprob=-0.726    ← CLASS NUMBER (what we want!)  
[4] token='"'      logprob=-0.838    ← QUOTE
[5] token=''       logprob=-0.663    ← END TOKEN (last token)

Old Logic:

final_token_logprob = token_logprobs[-1]  # Takes token[5] = '' (empty)
top_logprobs = final_token_logprob.top_logprobs  # Wrong token!

What happened:

  • The system extracted probabilities from the empty end token instead of the class number token
  • End token's top_logprobs contained things like 'class_', 'class_ \n\n', 'class_ ('
  • These don't match the expected class patterns 'class_1', 'class_2', etc.
  • Result: All scores became 0.0 (not relevant)

Example 3: Real Log Analysis

Example:

Reranker response: '"class_2"'
Reranker token_logprobs: [..., token='"', ..., token='', ...]
Reranker class_probs: {'class_': 0.515..., 'class_ \n\n': 0.454..., 'class_ (': 0.010...}
Reranker rerank_data: 0.0  ← WRONG! Should be ~0.5 for class_2

Analysis:

  • Model correctly predicted class_2
  • But system extracted from wrong token containing 'class_', not 'class_2'
  • No match for 'class_2' in the extracted probabilities → score = 0.0

The Fix: Smart Token Selection

New Logic:

# Find the actual class number token
class_token_logprob = None
for token_logprob in token_logprobs:
    if token_logprob.token in ['1', '2', '3', '4']:  # Look for class numbers
        class_token_logprob = token_logprob
        break

Applied to Example 2:

# Searches through tokens:
# token[0] = '"' → not in ['1','2','3','4'] → skip
# token[1] = 'class' → not in ['1','2','3','4'] → skip  
# token[2] = '_' → not in ['1','2','3','4'] → skip
# token[3] = '2' → FOUND! This is our class number
class_token_logprob = token_logprobs[3]  # Correct token!

Before vs After Comparison

Scenario max_tokens=4 max_tokens=100
Before Fix ✅ Works (lucky) ❌ Broken (extracts from wrong token)
After Fix ✅ Works (finds '1','2','3','4') ✅ Works (finds '1','2','3','4')

Real Results from Your Logs

Before Fix (max_tokens=100):

Reranker class_probs: {'class_': 0.515, 'class_ \n\n': 0.454, 'class_ (': 0.010}
Reranker rerank_data: 0.0  ← All documents marked as irrelevant

After Fix (max_tokens=100):

Reranker selected_class_token: 2  ← New log showing correct token
Reranker class_probs: {'class_2': 0.484, 'class_4': 0.201, 'class_1': 0.157, 'class_3': 0.157}
Reranker rerank_data: 0.60  ← Correct relevance score!

Why This Bug Was So Sneaky

  1. Worked fine with small max_tokens because there were fewer "distractor" tokens
  2. Failed silently - no errors thrown, just wrong scores
  3. Model was actually correct - the bug was in post-processing the response
  4. Affected all queries when max_tokens was increased, making everything seem "not relevant"

The fix ensures we always find the right token regardless of how many additional formatting tokens the model generates! 🎯

This comment was marked as outdated.

@entelligence-ai-pr-reviews

Copy link
Copy Markdown
Contributor

🔒 Entelligence AI Vulnerability Scanner

No security vulnerabilities found!

Your code passed our comprehensive security analysis.

📊 Files Analyzed: 3 files


@entelligence-ai-pr-reviews

Copy link
Copy Markdown
Contributor

Review Summary

🏷️ Draft Comments (2)

Skipped posting 2 draft comments that were valid but scored below your review threshold (>=13/15). Feel free to update them here.

mindsdb/integrations/utilities/rag/rerankers/base_reranker.py (1)

338-346: class_token_logprob fallback logic can select a non-class token (e.g., a quote or empty string), causing top_logprobs to lack class tokens and resulting in all-zero class probabilities and a relevance score of 0.0.

📊 Impact Scores:

  • Production Impact: 4/5
  • Fix Specificity: 5/5
  • Urgency Impact: 3/5
  • Total Score: 12/15

🤖 AI Agent Prompt (Copy & Paste Ready):

In mindsdb/integrations/utilities/rag/rerankers/base_reranker.py, lines 338-346, the fallback logic for selecting `class_token_logprob` can select a non-class token (such as a quote or empty string), which causes `top_logprobs` to lack class tokens and results in all-zero class probabilities and a relevance score of 0.0. Update the fallback so that if no class number token ('1', '2', '3', '4') is found, do NOT use the last token as fallback. Instead, immediately return a rerank_data with a relevance_score of 0.0. This prevents incorrect scoring when the model output does not contain a valid class token.

mindsdb/integrations/utilities/rag/settings.py (1)

735-738: DEFAULT_RERANKER_N, DEFAULT_RERANKER_LOGPROBS, DEFAULT_RERANKER_TOP_LOGPROBS, and DEFAULT_RERANKER_MAX_TOKENS are defined as module-level constants but are not validated in RerankerConfig, allowing invalid types/values that could cause runtime errors in downstream reranker logic.

📊 Impact Scores:

  • Production Impact: 4/5
  • Fix Specificity: 5/5
  • Urgency Impact: 3/5
  • Total Score: 12/15

🤖 AI Agent Prompt (Copy & Paste Ready):

In mindsdb/integrations/utilities/rag/settings.py, lines 735-738, the fields `n`, `logprobs`, `top_logprobs`, and `max_tokens` in `RerankerConfig` are assigned default values but lack type/value validation. This can allow invalid values (e.g., negative or zero) that may cause runtime errors in reranker logic. Please update these fields to use `Field` with appropriate type and value constraints (e.g., `ge=1` for positive integers) and add descriptions for each field.

@entelligence-ai-pr-reviews

Copy link
Copy Markdown
Contributor

Review Summary

🏷️ Draft Comments (2)

Skipped posting 2 draft comments that were valid but scored below your review threshold (>=13/15). Feel free to update them here.

mindsdb/integrations/utilities/rag/rerankers/base_reranker.py (2)

237-313: search_relevancy_score always uses hardcoded values for n, logprobs, top_logprobs, and max_tokens, ignoring user/config/environment overrides, so reranker config is silently ignored and runtime behavior is incorrect.

📊 Impact Scores:

  • Production Impact: 2/5
  • Fix Specificity: 3/5
  • Urgency Impact: 2/5
  • Total Score: 7/15

🤖 AI Agent Prompt (Copy & Paste Ready):

In mindsdb/integrations/utilities/rag/rerankers/base_reranker.py, lines 237-313, the `search_relevancy_score` method hardcodes `n=1`, `logprobs=True`, `top_logprobs=4`, and `max_tokens=3` in the OpenAI API call, ignoring any configuration or environment overrides. Update these parameters to use the instance's config values (e.g., `self.n`, `self.logprobs`, etc.) with fallback to the appropriate defaults. Ensure the reranker respects user/config/environment settings for these parameters.

110-147: _rank processes query-document pairs in batches but always gathers all results before early stopping, causing unnecessary LLM calls and wasted compute for large datasets.

📊 Impact Scores:

  • Production Impact: 2/5
  • Fix Specificity: 3/5
  • Urgency Impact: 2/5
  • Total Score: 7/15

🤖 AI Agent Prompt (Copy & Paste Ready):

In mindsdb/integrations/utilities/rag/rerankers/base_reranker.py, lines 110-147, the `_rank` method processes query-document pairs in batches but always gathers all results for each batch before checking for early stopping. This causes unnecessary LLM calls and wasted compute for large datasets. Refactor the code so that each document in the batch is processed sequentially, checking for early stopping after each result, and stop further processing as soon as the early stopping condition is met. Preserve error handling and logging.

@entelligence-ai-pr-reviews

Copy link
Copy Markdown
Contributor

Review Summary

🏷️ Draft Comments (1)

Skipped posting 1 draft comments that were valid but scored below your review threshold (>=13/15). Feel free to update them here.

mindsdb/integrations/utilities/rag/rerankers/base_reranker.py (1)

318-324: In search_relevancy_score, the code reconstructs class probabilities by iterating over top_logprobs of only a single token, which can miss relevant class tokens if the model outputs multiple class tokens or formatting tokens, leading to incorrect or zero scores for large max_tokens values.

📊 Impact Scores:

  • Production Impact: 4/5
  • Fix Specificity: 3/5
  • Urgency Impact: 3/5
  • Total Score: 10/15

🤖 AI Agent Prompt (Copy & Paste Ready):

In mindsdb/integrations/utilities/rag/rerankers/base_reranker.py, lines 318-324, the current code only reconstructs class probabilities from the top_logprobs of the last token, which can miss relevant class tokens if the model outputs multiple class tokens or formatting tokens (especially with large max_tokens). Refactor this section to aggregate top_logprobs from all tokens that could represent a class number ("1", "2", "3", "4"), and build the class_probs dictionary accordingly. This will ensure correct scoring regardless of output length.

This comment was marked as outdated.

@entelligence-ai-pr-reviews

Copy link
Copy Markdown
Contributor

Review Summary

This comment was marked as outdated.

@entelligence-ai-pr-reviews

Copy link
Copy Markdown
Contributor

Review Summary

🏷️ Draft Comments (1)

Skipped posting 1 draft comments that were valid but scored below your review threshold (>=13/15). Feel free to update them here.

mindsdb/utilities/render/sqlalchemy_render.py (1)

517-652: prepare_select builds the entire SQLAlchemy query AST recursively for all subqueries, joins, and expressions, which can cause substantial CPU and memory usage for deeply nested or very large queries.

📊 Impact Scores:

  • Production Impact: 3/5
  • Fix Specificity: 2/5
  • Urgency Impact: 2/5
  • Total Score: 7/15

🤖 AI Agent Prompt (Copy & Paste Ready):

In mindsdb/utilities/render/sqlalchemy_render.py, lines 517-652, the `prepare_select` method recursively constructs the full SQLAlchemy AST for all subqueries, joins, and expressions. For very large or deeply nested queries, this can cause significant CPU and memory usage, potentially leading to performance bottlenecks or stack overflows. Consider introducing iterative processing for join trees, limiting recursion depth, or streaming query construction for extremely large queries to improve scalability and resource usage.

Increases the `DEFAULT_RERANKER_MAX_TOKENS` from 3 to 100.
This allows the re-ranker to process longer text snippets
for improved accuracy.
@entelligence-ai-pr-reviews

Copy link
Copy Markdown
Contributor

Review Summary

🔍 Comments beyond diff scope (1)
mindsdb/integrations/utilities/rag/settings.py (1)

0817-0817: id_key is defined as int but DEFAULT_ID_KEY is a string; this causes runtime type errors when initializing RAGPipelineModel.
Category: correctness


dusvyat
dusvyat previously approved these changes Sep 10, 2025
@entelligence-ai-pr-reviews

Copy link
Copy Markdown
Contributor

Review Summary

🏷️ Draft Comments (3)

Skipped posting 3 draft comments that were valid but scored below your review threshold (>=13/15). Feel free to update them here.

mindsdb/__main__.py (1)

157-165: clean_mindsdb_tmp_dir iterates and deletes all files/dirs in the temp dir at exit, which can cause major performance issues if the directory contains a large number of files (O(n) deletions on shutdown).

📊 Impact Scores:

  • Production Impact: 2/5
  • Fix Specificity: 5/5
  • Urgency Impact: 2/5
  • Total Score: 9/15

🤖 AI Agent Prompt (Copy & Paste Ready):

In mindsdb/__main__.py, lines 157-165, the function `clean_mindsdb_tmp_dir` iterates and deletes all files/directories in the temp dir at exit, which can cause major performance issues if the directory contains many files (O(n) deletions on shutdown). Replace the loop with a single `shutil.rmtree(temp_dir, ignore_errors=True)` followed by `temp_dir.mkdir(exist_ok=True)` to efficiently clear the directory.

mindsdb/api/http/namespaces/config.py (1)

38-41: api_configs = copy.deepcopy(config["api"]) will raise a KeyError if the 'api' key is missing from the config, causing a 500 error on GET /config.

📊 Impact Scores:

  • Production Impact: 4/5
  • Fix Specificity: 5/5
  • Urgency Impact: 3/5
  • Total Score: 12/15

🤖 AI Agent Prompt (Copy & Paste Ready):

In mindsdb/api/http/namespaces/config.py, lines 38-41, the code assumes the 'api' key exists in the config, which can cause a KeyError if it is missing. Change 'config["api"]' to 'config.get("api", {})' to prevent runtime errors when the key is absent.

mindsdb/utilities/api_status.py (1)

40-58: set_api_status does not handle concurrent writes from multiple processes, which can cause data loss or corruption if two processes write at the same time.

📊 Impact Scores:

  • Production Impact: 4/5
  • Fix Specificity: 3/5
  • Urgency Impact: 3/5
  • Total Score: 10/15

🤖 AI Agent Prompt (Copy & Paste Ready):

In mindsdb/utilities/api_status.py, lines 40-58, the current implementation of set_api_status does not prevent concurrent writes, which can cause data loss if two processes write at the same time. Update set_api_status to use file locking (e.g., fcntl.flock) to ensure only one process can write to the status file at a time. The fix should lock the file, reload the latest status, update it, and then write atomically.

🔍 Comments beyond diff scope (1)
mindsdb/api/http/namespaces/config.py (1)

126-130: file.save(file_path) in the integration upload logic allows attackers to upload files with arbitrary filenames, potentially leading to path traversal and overwriting sensitive files if the filename is crafted maliciously.
Category: security


Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@entelligence-ai-pr-reviews

Copy link
Copy Markdown
Contributor

Review Summary

🏷️ Draft Comments (1)

Skipped posting 1 draft comments that were valid but scored below your review threshold (>=13/15). Feel free to update them here.

mindsdb/integrations/libs/vectordatabase_handler.py (1)

324-329: do_upsert performs a full select of existing IDs for every upsert, which can be highly inefficient for large batches, causing O(n) database roundtrips and memory usage.

📊 Impact Scores:

  • Production Impact: 2/5
  • Fix Specificity: 2/5
  • Urgency Impact: 2/5
  • Total Score: 6/15

🤖 AI Agent Prompt (Copy & Paste Ready):

Optimize the upsert logic in mindsdb/integrations/libs/vectordatabase_handler.py, lines 324-329. The current code performs a full select of all possible IDs for every upsert, which is inefficient for large batches. Refactor to first use the new check_existing_ids method to get only the existing IDs, then only select metadata for those IDs. This reduces database load and memory usage for large upserts.

@entelligence-ai-pr-reviews

Copy link
Copy Markdown
Contributor

Review Summary

🏷️ Draft Comments (2)

Skipped posting 2 draft comments that were valid but scored below your review threshold (>=13/15). Feel free to update them here.

.github/workflows/build_deploy_prod.yml (1)

79-79: Switching from a local to a remote docker-bake action (mindsdb/github-actions/docker-bake@main) may introduce longer cold start times and network dependency, potentially slowing down builds if the remote action is unavailable or slow.

📊 Impact Scores:

  • Production Impact: 2/5
  • Fix Specificity: 2/5
  • Urgency Impact: 2/5
  • Total Score: 6/15

🤖 AI Agent Prompt (Copy & Paste Ready):

Evaluate the performance impact of switching from a local to a remote `docker-bake` action at line 79 in `.github/workflows/build_deploy_prod.yml`. If build times increase due to network dependency or remote action cold starts, consider caching or pinning a specific version to mitigate delays.

mindsdb/utilities/render/sqlalchemy_render.py (1)

172-177: prepare_select join logic uses a linear search and repeated list appends for aliases, which can cause O(n^2) performance for large/complex queries with many aliases.

📊 Impact Scores:

  • Production Impact: 3/5
  • Fix Specificity: 2/5
  • Urgency Impact: 2/5
  • Total Score: 7/15

🤖 AI Agent Prompt (Copy & Paste Ready):

In mindsdb/utilities/render/sqlalchemy_render.py, lines 172-177, the alias generation logic uses a list and linear search for uniqueness, which can cause O(n^2) performance for large queries. Refactor this to use a set for alias existence checks to ensure O(1) lookup and maintain append order.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants