feat: improve maintainers detection [CM-1033]#3908
Conversation
|
|
There was a problem hiding this comment.
Pull request overview
This PR improves maintainer file detection in the git integration service by adding a multi-step discovery and analysis flow that combines static filename matching, dynamic ripgrep-based content search, and an AI fallback, while also surfacing more metadata about what was tried.
Changes:
- Added ripgrep-based repo scanning (
rg --filesand keyword search) with fallback toos.walk, plus scoring/filtering of dynamic candidates. - Refactored maintainer extraction to prioritize a previously saved maintainer file, then analyze top candidates, then use AI file suggestion as a last resort.
- Extended
MaintainerResultand service execution metrics to includecandidate_filesandai_suggested_file; addedripgrepto the Docker image.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py | New candidate discovery + fallback extraction flow; logs and metrics now include candidate/AI-suggested file metadata. |
| services/apps/git_integration/src/crowdgit/models/maintainer_info.py | Adds new result metadata fields (candidate_files, ai_suggested_file). |
| scripts/services/docker/Dockerfile.git_integration | Installs ripgrep in the runner image to support dynamic search. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| candidate_files: list[str] = [] | ||
| ai_suggested_file: str | None = None |
There was a problem hiding this comment.
candidate_files uses a mutable default ([]) on a Pydantic model, which can be shared across instances and lead to cross-request/state leakage. Use Field(default_factory=list) (and import Field) instead of a bare list default.
| except CommandExecutionError: | ||
| self.logger.info("Ripgrep found no files containing maintainer keywords") | ||
| return [] |
There was a problem hiding this comment.
run_shell_command raises CommandExecutionError for any non-zero exit code. In _ripgrep_search, catching CommandExecutionError and assuming it means “no matches” will also silently swallow real ripgrep failures (e.g., invalid regex, permission errors), causing false “no candidate files” behavior. Consider updating run_shell_command to expose returncode, or adding an option to treat specific exit codes (rg uses 1 for no matches) as success while still surfacing actual errors.
| except CommandExecutionError: | |
| self.logger.info("Ripgrep found no files containing maintainer keywords") | |
| return [] | |
| except CommandExecutionError as e: | |
| # ripgrep uses exit code 1 to mean "no matches"; other non-zero codes indicate errors. | |
| returncode = getattr(e, "returncode", None) | |
| if returncode == 1: | |
| self.logger.info("Ripgrep found no files containing maintainer keywords") | |
| return [] | |
| self.logger.warning( | |
| f"Ripgrep search failed with non-zero exit code {returncode}: {repr(e)}" | |
| ) | |
| # Re-raise so that real ripgrep failures are not silently treated as 'no matches'. | |
| raise |
| async def extract_maintainers( | ||
| self, | ||
| repo_path: str, | ||
| saved_maintainer_file: str | None = None, | ||
| ): | ||
| total_cost = 0 | ||
| candidate_files: list[str] = [] | ||
| ai_suggested_file: str | None = None | ||
|
|
||
| def _attach_metadata(result: MaintainerResult) -> MaintainerResult: | ||
| result.total_cost = total_cost | ||
| result.candidate_files = candidate_files | ||
| result.ai_suggested_file = ai_suggested_file | ||
| return result | ||
|
|
||
| # Step 1: Try the previously saved maintainer file | ||
| if saved_maintainer_file: | ||
| self.logger.info(f"Trying saved maintainer file: {saved_maintainer_file}") | ||
| result, cost = await self.try_saved_maintainer_file(repo_path, saved_maintainer_file) | ||
| total_cost += cost | ||
| if result: | ||
| return _attach_metadata(result) | ||
| self.logger.info("Falling back to maintainer file detection") | ||
|
|
||
| # Step 2: Find candidates via static list + ripgrep dynamic search | ||
| candidates = await self.find_candidate_files(repo_path) | ||
| candidate_files = [path for path, _ in candidates] | ||
|
|
||
| # Step 3: Try AI analysis on candidates, stop on first success | ||
| if candidates: | ||
| attempts = min(len(candidates), self.MAX_AI_ANALYSIS_ATTEMPTS) | ||
| for filename, content in candidates[:attempts]: | ||
| try: | ||
| result = await self.analyze_and_build_result(filename, content) | ||
| total_cost += result.total_cost | ||
| return _attach_metadata(result) | ||
| except MaintanerAnalysisError as e: | ||
| total_cost += e.ai_cost | ||
| self.logger.warning(f"AI analysis failed for '{filename}': {e.error_message}") | ||
| except Exception as e: | ||
| self.logger.warning(f"Unexpected error analyzing '{filename}': {repr(e)}") | ||
|
|
||
| self.logger.warning( | ||
| f"AI analysis failed for all {attempts} candidate(s), trying AI file detection" | ||
| ) | ||
| else: | ||
| self.logger.warning("No candidate files found via search, trying AI file detection") | ||
|
|
There was a problem hiding this comment.
This PR introduces a new multi-step maintainer extraction flow (saved-file retry, static+rg candidates, bounded AI analysis attempts, AI fallback) but there are currently no tests covering MaintainerService behavior. Adding unit tests around candidate discovery/filtering and the fallback ordering would help prevent regressions (especially around rg failures and the MAX_AI_ANALYSIS_ATTEMPTS cutoff).
| dynamic_paths = await self._ripgrep_search(repo_path) | ||
|
|
||
| scored_dynamic = [] | ||
| for candidate_path in dynamic_paths: | ||
| if candidate_path.lower() in static_paths_lower: | ||
| continue | ||
|
|
||
| file_path = os.path.join(repo_path, candidate_path) | ||
| try: | ||
| async with aiofiles.open(file_path, "r", encoding="utf-8") as f: | ||
| content = await f.read() | ||
| except Exception as e: | ||
| self.logger.warning(f"Failed to read dynamic match {candidate_path}: {repr(e)}") | ||
| continue | ||
|
|
||
| if file.lower() == "readme.md" and "maintainer" not in content.lower(): | ||
| self.logger.info(f"Skipping {file}: no maintainer-related content found") | ||
| continue | ||
|
|
||
| return file, base64.b64encode(content.encode()).decode(), 0 | ||
|
|
||
| self.logger.warning("No maintainer files found using the known file names.") | ||
| content_lower = content.lower() | ||
| # Calculate score based on keywords matched in the content | ||
| score = sum(1 for kw in self.CONTENT_VALIDATION_KEYWORDS if kw in content_lower) | ||
| if score > 0: | ||
| scored_dynamic.append((candidate_path, content, score)) | ||
| self.logger.info( | ||
| f"Dynamic match validated: {candidate_path} (keyword score: {score})" | ||
| ) | ||
|
|
There was a problem hiding this comment.
find_candidate_files reads the full contents of every rg -l match to compute a keyword score. With broad keywords like owner, ripgrep can return a large number of files, making this O(num_matches * file_size) and potentially very slow/memory-heavy on big repos even though only the first few candidates are later analyzed. Consider capping the number of dynamic matches to score, reading only a small prefix for scoring, or using ripgrep match counts (e.g., --count-matches/--json) to avoid loading whole files.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
| ) | ||
| except CommandExecutionError: | ||
| self.logger.info("Ripgrep found no files containing maintainer keywords") | ||
| return [] |
There was a problem hiding this comment.
Ripgrep error indistinguishable from no-matches result
Low Severity
_ripgrep_search catches CommandExecutionError and logs "Ripgrep found no files containing maintainer keywords", but run_shell_command raises the same CommandExecutionError for all non-zero exit codes. Ripgrep uses exit code 1 for "no matches" and exit code 2 for actual errors (e.g., invalid regex, permission issues). Genuine ripgrep failures are silently treated as "no matches found" with a misleading info-level log, instead of being logged as warnings, which could hinder debugging.
| for filename in filenames: | ||
| full_path = os.path.join(dirpath, filename) | ||
| results.append(os.path.relpath(full_path, repo_path)) | ||
| return results |
There was a problem hiding this comment.
Blocking os.walk call in async method
Low Severity
The _list_repo_files fallback uses synchronous os.walk inside an async method. For large repositories with deep directory trees, this blocks the event loop, potentially stalling all concurrent async tasks. The rest of the codebase uses aiofiles.os for async filesystem operations.


This pull request significantly improves the maintainer file detection and analysis logic for the git integration service. The main enhancements include a more robust and flexible approach to identifying maintainer-related files using both static lists and dynamic content-based search (via ripgrep), as well as improved handling and reporting of candidate files and AI-suggested files. Additionally, the changes introduce better error handling and metadata tracking throughout the maintainer extraction process.
Maintainer file detection and analysis improvements:
ripgrepto search for maintainer-related keywords across the repository, with fallback toos.walkifripgrepfails. This enables more accurate and comprehensive detection of potential maintainer files.Metadata and reporting enhancements:
MaintainerResultmodel to includecandidate_filesandai_suggested_filefields, allowing downstream consumers to see which files were considered and which file was suggested by AI.Dependency and utility updates:
ripgreptool to the Docker image to support fast and efficient content-based file search.These changes collectively make maintainer detection more reliable, transparent, and extensible, while also improving observability for debugging and future enhancements.
Note
Medium Risk
Modifies the maintainer detection/analysis flow and introduces repo-wide
rgsearches plus new fallback paths, which could affect accuracy, performance, and execution behavior across many repos.Overview
Improves maintainer extraction by prioritizing the previously saved maintainer file, then scanning the repo for likely governance files (expanded static filename list +
ripgrepkeyword search with filtering/scoring), and only falling back to AI filename selection as a last resort.Adds observability by returning/storing
candidate_filesandai_suggested_fileonMaintainerResultand persisting them inServiceExecution.metrics, and updates the git-integration Docker runner image to includeripgrep.Written by Cursor Bugbot for commit cc717d8. This will update automatically on new commits. Configure here.