Skip to content

feat: improve maintainers detection [CM-1033]#3908

Draft
mbani01 wants to merge 4 commits intomainfrom
feat/improve_maintainer_file_detection
Draft

feat: improve maintainers detection [CM-1033]#3908
mbani01 wants to merge 4 commits intomainfrom
feat/improve_maintainer_file_detection

Conversation

@mbani01
Copy link
Contributor

@mbani01 mbani01 commented Mar 10, 2026

This pull request significantly improves the maintainer file detection and analysis logic for the git integration service. The main enhancements include a more robust and flexible approach to identifying maintainer-related files using both static lists and dynamic content-based search (via ripgrep), as well as improved handling and reporting of candidate files and AI-suggested files. Additionally, the changes introduce better error handling and metadata tracking throughout the maintainer extraction process.

Maintainer file detection and analysis improvements:

  • Added dynamic maintainer file discovery using ripgrep to search for maintainer-related keywords across the repository, with fallback to os.walk if ripgrep fails. This enables more accurate and comprehensive detection of potential maintainer files.
  • Expanded and refined the static list of possible maintainer file names and valid file extensions, and introduced content validation keywords and exclusion rules to improve candidate file filtering.
  • Implemented a new extraction flow that prioritizes previously saved maintainer files, then tries static and dynamic candidates, and finally falls back to AI-based file detection as a last resort. The process now tracks and returns candidate files and AI-suggested files as metadata. [1] [2]

Metadata and reporting enhancements:

  • Updated the MaintainerResult model to include candidate_files and ai_suggested_file fields, allowing downstream consumers to see which files were considered and which file was suggested by AI.
  • Extended the maintainer processing pipeline to record and report candidate files and AI-suggested files in service execution logs and results. [1] [2] [3]

Dependency and utility updates:

  • Added the ripgrep tool to the Docker image to support fast and efficient content-based file search.
  • Updated imports and utility usage to support the new shell command execution and error handling required by the ripgrep-based search. [1] [2]

These changes collectively make maintainer detection more reliable, transparent, and extensible, while also improving observability for debugging and future enhancements.


Note

Medium Risk
Modifies the maintainer detection/analysis flow and introduces repo-wide rg searches plus new fallback paths, which could affect accuracy, performance, and execution behavior across many repos.

Overview
Improves maintainer extraction by prioritizing the previously saved maintainer file, then scanning the repo for likely governance files (expanded static filename list + ripgrep keyword search with filtering/scoring), and only falling back to AI filename selection as a last resort.

Adds observability by returning/storing candidate_files and ai_suggested_file on MaintainerResult and persisting them in ServiceExecution.metrics, and updates the git-integration Docker runner image to include ripgrep.

Written by Cursor Bugbot for commit cc717d8. This will update automatically on new commits. Configure here.

@mbani01 mbani01 self-assigned this Mar 10, 2026
Copilot AI review requested due to automatic review settings March 10, 2026 15:41
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves maintainer file detection in the git integration service by adding a multi-step discovery and analysis flow that combines static filename matching, dynamic ripgrep-based content search, and an AI fallback, while also surfacing more metadata about what was tried.

Changes:

  • Added ripgrep-based repo scanning (rg --files and keyword search) with fallback to os.walk, plus scoring/filtering of dynamic candidates.
  • Refactored maintainer extraction to prioritize a previously saved maintainer file, then analyze top candidates, then use AI file suggestion as a last resort.
  • Extended MaintainerResult and service execution metrics to include candidate_files and ai_suggested_file; added ripgrep to the Docker image.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py New candidate discovery + fallback extraction flow; logs and metrics now include candidate/AI-suggested file metadata.
services/apps/git_integration/src/crowdgit/models/maintainer_info.py Adds new result metadata fields (candidate_files, ai_suggested_file).
scripts/services/docker/Dockerfile.git_integration Installs ripgrep in the runner image to support dynamic search.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +37 to +38
candidate_files: list[str] = []
ai_suggested_file: str | None = None
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

candidate_files uses a mutable default ([]) on a Pydantic model, which can be shared across instances and lead to cross-request/state leakage. Use Field(default_factory=list) (and import Field) instead of a bare list default.

Copilot uses AI. Check for mistakes.
Comment on lines +423 to +425
except CommandExecutionError:
self.logger.info("Ripgrep found no files containing maintainer keywords")
return []
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_shell_command raises CommandExecutionError for any non-zero exit code. In _ripgrep_search, catching CommandExecutionError and assuming it means “no matches” will also silently swallow real ripgrep failures (e.g., invalid regex, permission errors), causing false “no candidate files” behavior. Consider updating run_shell_command to expose returncode, or adding an option to treat specific exit codes (rg uses 1 for no matches) as success while still surfacing actual errors.

Suggested change
except CommandExecutionError:
self.logger.info("Ripgrep found no files containing maintainer keywords")
return []
except CommandExecutionError as e:
# ripgrep uses exit code 1 to mean "no matches"; other non-zero codes indicate errors.
returncode = getattr(e, "returncode", None)
if returncode == 1:
self.logger.info("Ripgrep found no files containing maintainer keywords")
return []
self.logger.warning(
f"Ripgrep search failed with non-zero exit code {returncode}: {repr(e)}"
)
# Re-raise so that real ripgrep failures are not silently treated as 'no matches'.
raise

Copilot uses AI. Check for mistakes.
Comment on lines +559 to 606
async def extract_maintainers(
self,
repo_path: str,
saved_maintainer_file: str | None = None,
):
total_cost = 0
candidate_files: list[str] = []
ai_suggested_file: str | None = None

def _attach_metadata(result: MaintainerResult) -> MaintainerResult:
result.total_cost = total_cost
result.candidate_files = candidate_files
result.ai_suggested_file = ai_suggested_file
return result

# Step 1: Try the previously saved maintainer file
if saved_maintainer_file:
self.logger.info(f"Trying saved maintainer file: {saved_maintainer_file}")
result, cost = await self.try_saved_maintainer_file(repo_path, saved_maintainer_file)
total_cost += cost
if result:
return _attach_metadata(result)
self.logger.info("Falling back to maintainer file detection")

# Step 2: Find candidates via static list + ripgrep dynamic search
candidates = await self.find_candidate_files(repo_path)
candidate_files = [path for path, _ in candidates]

# Step 3: Try AI analysis on candidates, stop on first success
if candidates:
attempts = min(len(candidates), self.MAX_AI_ANALYSIS_ATTEMPTS)
for filename, content in candidates[:attempts]:
try:
result = await self.analyze_and_build_result(filename, content)
total_cost += result.total_cost
return _attach_metadata(result)
except MaintanerAnalysisError as e:
total_cost += e.ai_cost
self.logger.warning(f"AI analysis failed for '{filename}': {e.error_message}")
except Exception as e:
self.logger.warning(f"Unexpected error analyzing '{filename}': {repr(e)}")

self.logger.warning(
f"AI analysis failed for all {attempts} candidate(s), trying AI file detection"
)
else:
self.logger.warning("No candidate files found via search, trying AI file detection")

Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduces a new multi-step maintainer extraction flow (saved-file retry, static+rg candidates, bounded AI analysis attempts, AI fallback) but there are currently no tests covering MaintainerService behavior. Adding unit tests around candidate discovery/filtering and the fallback ordering would help prevent regressions (especially around rg failures and the MAX_AI_ANALYSIS_ATTEMPTS cutoff).

Copilot uses AI. Check for mistakes.
Comment on lines +474 to 497
dynamic_paths = await self._ripgrep_search(repo_path)

scored_dynamic = []
for candidate_path in dynamic_paths:
if candidate_path.lower() in static_paths_lower:
continue

file_path = os.path.join(repo_path, candidate_path)
try:
async with aiofiles.open(file_path, "r", encoding="utf-8") as f:
content = await f.read()
except Exception as e:
self.logger.warning(f"Failed to read dynamic match {candidate_path}: {repr(e)}")
continue

if file.lower() == "readme.md" and "maintainer" not in content.lower():
self.logger.info(f"Skipping {file}: no maintainer-related content found")
continue

return file, base64.b64encode(content.encode()).decode(), 0

self.logger.warning("No maintainer files found using the known file names.")
content_lower = content.lower()
# Calculate score based on keywords matched in the content
score = sum(1 for kw in self.CONTENT_VALIDATION_KEYWORDS if kw in content_lower)
if score > 0:
scored_dynamic.append((candidate_path, content, score))
self.logger.info(
f"Dynamic match validated: {candidate_path} (keyword score: {score})"
)

Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

find_candidate_files reads the full contents of every rg -l match to compute a keyword score. With broad keywords like owner, ripgrep can return a large number of files, making this O(num_matches * file_size) and potentially very slow/memory-heavy on big repos even though only the first few candidates are later analyzed. Consider capping the number of dynamic matches to score, reading only a small prefix for scoring, or using ripgrep match counts (e.g., --count-matches/--json) to avoid loading whole files.

Copilot uses AI. Check for mistakes.
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

)
except CommandExecutionError:
self.logger.info("Ripgrep found no files containing maintainer keywords")
return []
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ripgrep error indistinguishable from no-matches result

Low Severity

_ripgrep_search catches CommandExecutionError and logs "Ripgrep found no files containing maintainer keywords", but run_shell_command raises the same CommandExecutionError for all non-zero exit codes. Ripgrep uses exit code 1 for "no matches" and exit code 2 for actual errors (e.g., invalid regex, permission issues). Genuine ripgrep failures are silently treated as "no matches found" with a misleading info-level log, instead of being logged as warnings, which could hinder debugging.

Fix in Cursor Fix in Web

for filename in filenames:
full_path = os.path.join(dirpath, filename)
results.append(os.path.relpath(full_path, repo_path))
return results
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking os.walk call in async method

Low Severity

The _list_repo_files fallback uses synchronous os.walk inside an async method. For large repositories with deep directory trees, this blocks the event loop, potentially stalling all concurrent async tasks. The rest of the codebase uses aiofiles.os for async filesystem operations.

Fix in Cursor Fix in Web

@mbani01 mbani01 requested a review from joanagmaia March 10, 2026 16:42
@mbani01 mbani01 marked this pull request as draft March 10, 2026 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants