feat: improve maintainers detection [CM-1033] by mbani01 · Pull Request #3908 · linuxfoundation/crowd.dev

mbani01 · 2026-03-10T15:41:38Z

This pull request significantly improves the maintainer file detection and analysis logic for the git integration service. The main enhancements include a more robust and flexible approach to identifying maintainer-related files using both static lists and dynamic content-based search (via ripgrep), as well as improved handling and reporting of candidate files and AI-suggested files. Additionally, the changes introduce better error handling and metadata tracking throughout the maintainer extraction process.

Maintainer file detection and analysis improvements:

Added dynamic maintainer file discovery using ripgrep to search for maintainer-related keywords across the repository, with fallback to os.walk if ripgrep fails. This enables more accurate and comprehensive detection of potential maintainer files.
Expanded and refined the static list of possible maintainer file names and valid file extensions, and introduced content validation keywords and exclusion rules to improve candidate file filtering.
Implemented a new extraction flow that prioritizes previously saved maintainer files, then tries static and dynamic candidates, and finally falls back to AI-based file detection as a last resort. The process now tracks and returns candidate files and AI-suggested files as metadata. [1] [2]

Metadata and reporting enhancements:

Updated the MaintainerResult model to include candidate_files and ai_suggested_file fields, allowing downstream consumers to see which files were considered and which file was suggested by AI.
Extended the maintainer processing pipeline to record and report candidate files and AI-suggested files in service execution logs and results. [1] [2] [3]

Dependency and utility updates:

Added the ripgrep tool to the Docker image to support fast and efficient content-based file search.
Updated imports and utility usage to support the new shell command execution and error handling required by the ripgrep-based search. [1] [2]

These changes collectively make maintainer detection more reliable, transparent, and extensible, while also improving observability for debugging and future enhancements.

Note

Medium Risk
Modifies the maintainer detection/analysis flow and introduces repo-wide rg searches plus new fallback paths, which could affect accuracy, performance, and execution behavior across many repos.

Overview
Improves maintainer extraction by prioritizing the previously saved maintainer file, then scanning the repo for likely governance files (expanded static filename list + ripgrep keyword search with filtering/scoring), and only falling back to AI filename selection as a last resort.

Adds observability by returning/storing candidate_files and ai_suggested_file on MaintainerResult and persisting them in ServiceExecution.metrics, and updates the git-integration Docker runner image to include ripgrep.

^{Written by Cursor Bugbot for commit cc717d8. This will update automatically on new commits. Configure here.}

… detection

CLAassistant · 2026-03-10T15:41:56Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Copilot

Pull request overview

This PR improves maintainer file detection in the git integration service by adding a multi-step discovery and analysis flow that combines static filename matching, dynamic ripgrep-based content search, and an AI fallback, while also surfacing more metadata about what was tried.

Changes:

Added ripgrep-based repo scanning (rg --files and keyword search) with fallback to os.walk, plus scoring/filtering of dynamic candidates.
Refactored maintainer extraction to prioritize a previously saved maintainer file, then analyze top candidates, then use AI file suggestion as a last resort.
Extended MaintainerResult and service execution metrics to include candidate_files and ai_suggested_file; added ripgrep to the Docker image.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py	New candidate discovery + fallback extraction flow; logs and metrics now include candidate/AI-suggested file metadata.
services/apps/git_integration/src/crowdgit/models/maintainer_info.py	Adds new result metadata fields (`candidate_files`, `ai_suggested_file`).
scripts/services/docker/Dockerfile.git_integration	Installs `ripgrep` in the runner image to support dynamic search.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-10T15:46:56Z

services/apps/git_integration/src/crowdgit/models/maintainer_info.py

+    candidate_files: list[str] = []
+    ai_suggested_file: str | None = None


candidate_files uses a mutable default ([]) on a Pydantic model, which can be shared across instances and lead to cross-request/state leakage. Use Field(default_factory=list) (and import Field) instead of a bare list default.

Copilot · 2026-03-10T15:46:56Z

services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py

+        except CommandExecutionError:
+            self.logger.info("Ripgrep found no files containing maintainer keywords")
+            return []


run_shell_command raises CommandExecutionError for any non-zero exit code. In _ripgrep_search, catching CommandExecutionError and assuming it means “no matches” will also silently swallow real ripgrep failures (e.g., invalid regex, permission errors), causing false “no candidate files” behavior. Consider updating run_shell_command to expose returncode, or adding an option to treat specific exit codes (rg uses 1 for no matches) as success while still surfacing actual errors.

Suggested change

except CommandExecutionError:

self.logger.info("Ripgrep found no files containing maintainer keywords")

return []

except CommandExecutionError as e:

# ripgrep uses exit code 1 to mean "no matches"; other non-zero codes indicate errors.

returncode = getattr(e, "returncode", None)

if returncode == 1:

self.logger.info("Ripgrep found no files containing maintainer keywords")

return []

self.logger.warning(

f"Ripgrep search failed with non-zero exit code {returncode}: {repr(e)}"

)

# Re-raise so that real ripgrep failures are not silently treated as 'no matches'.

raise

Copilot · 2026-03-10T15:46:57Z

services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py

+    async def extract_maintainers(
+        self,
+        repo_path: str,
+        saved_maintainer_file: str | None = None,
+    ):
+        total_cost = 0
+        candidate_files: list[str] = []
+        ai_suggested_file: str | None = None
+
+        def _attach_metadata(result: MaintainerResult) -> MaintainerResult:
+            result.total_cost = total_cost
+            result.candidate_files = candidate_files
+            result.ai_suggested_file = ai_suggested_file
+            return result
+
+        # Step 1: Try the previously saved maintainer file
+        if saved_maintainer_file:
+            self.logger.info(f"Trying saved maintainer file: {saved_maintainer_file}")
+            result, cost = await self.try_saved_maintainer_file(repo_path, saved_maintainer_file)
+            total_cost += cost
+            if result:
+                return _attach_metadata(result)
+            self.logger.info("Falling back to maintainer file detection")
+
+        # Step 2: Find candidates via static list + ripgrep dynamic search
+        candidates = await self.find_candidate_files(repo_path)
+        candidate_files = [path for path, _ in candidates]
+
+        # Step 3: Try AI analysis on candidates, stop on first success
+        if candidates:
+            attempts = min(len(candidates), self.MAX_AI_ANALYSIS_ATTEMPTS)
+            for filename, content in candidates[:attempts]:
+                try:
+                    result = await self.analyze_and_build_result(filename, content)
+                    total_cost += result.total_cost
+                    return _attach_metadata(result)
+                except MaintanerAnalysisError as e:
+                    total_cost += e.ai_cost
+                    self.logger.warning(f"AI analysis failed for '{filename}': {e.error_message}")
+                except Exception as e:
+                    self.logger.warning(f"Unexpected error analyzing '{filename}': {repr(e)}")
+
+            self.logger.warning(
+                f"AI analysis failed for all {attempts} candidate(s), trying AI file detection"
+            )
+        else:
+            self.logger.warning("No candidate files found via search, trying AI file detection")



This PR introduces a new multi-step maintainer extraction flow (saved-file retry, static+rg candidates, bounded AI analysis attempts, AI fallback) but there are currently no tests covering MaintainerService behavior. Adding unit tests around candidate discovery/filtering and the fallback ordering would help prevent regressions (especially around rg failures and the MAX_AI_ANALYSIS_ATTEMPTS cutoff).

Copilot · 2026-03-10T15:46:57Z

services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py

+        dynamic_paths = await self._ripgrep_search(repo_path)
+
+        scored_dynamic = []
+        for candidate_path in dynamic_paths:
+            if candidate_path.lower() in static_paths_lower:
+                continue
+
+            file_path = os.path.join(repo_path, candidate_path)
+            try:
                async with aiofiles.open(file_path, "r", encoding="utf-8") as f:
                    content = await f.read()
+            except Exception as e:
+                self.logger.warning(f"Failed to read dynamic match {candidate_path}: {repr(e)}")
+                continue

-                if file.lower() == "readme.md" and "maintainer" not in content.lower():
-                    self.logger.info(f"Skipping {file}: no maintainer-related content found")
-                    continue
-
-                return file, base64.b64encode(content.encode()).decode(), 0
-
-        self.logger.warning("No maintainer files found using the known file names.")
+            content_lower = content.lower()
+            # Calculate score based on keywords matched in the content
+            score = sum(1 for kw in self.CONTENT_VALIDATION_KEYWORDS if kw in content_lower)
+            if score > 0:
+                scored_dynamic.append((candidate_path, content, score))
+                self.logger.info(
+                    f"Dynamic match validated: {candidate_path} (keyword score: {score})"
+                )



find_candidate_files reads the full contents of every rg -l match to compute a keyword score. With broad keywords like owner, ripgrep can return a large number of files, making this O(num_matches * file_size) and potentially very slow/memory-heavy on big repos even though only the first few candidates are later analyzed. Consider capping the number of dynamic matches to score, reading only a small prefix for scoring, or using ripgrep match counts (e.g., --count-matches/--json) to avoid loading whole files.

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

cursor · 2026-03-10T15:53:48Z

services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py

+            )
+        except CommandExecutionError:
+            self.logger.info("Ripgrep found no files containing maintainer keywords")
+            return []


Ripgrep error indistinguishable from no-matches result

Low Severity

_ripgrep_search catches CommandExecutionError and logs "Ripgrep found no files containing maintainer keywords", but run_shell_command raises the same CommandExecutionError for all non-zero exit codes. Ripgrep uses exit code 1 for "no matches" and exit code 2 for actual errors (e.g., invalid regex, permission issues). Genuine ripgrep failures are silently treated as "no matches found" with a misleading info-level log, instead of being logged as warnings, which could hinder debugging.

cursor · 2026-03-10T15:53:48Z

services/apps/git_integration/src/crowdgit/services/maintainer/maintainer_service.py

+                for filename in filenames:
+                    full_path = os.path.join(dirpath, filename)
+                    results.append(os.path.relpath(full_path, repo_path))
+            return results


Blocking os.walk call in async method

Low Severity

The _list_repo_files fallback uses synchronous os.walk inside an async method. For large repositories with deep directory trees, this blocks the event loop, potentially stalling all concurrent async tasks. The rest of the codebase uses aiofiles.os for async filesystem operations.

mbani01 added 4 commits March 10, 2026 11:50

chore: install ripgrep

fb0871f

feat: leverage maintainersFile from db before falling back to regular…

30d324d

… detection

feat: improve maintainers detection & analysis

099d19b

feat: track analyzed maintainers files in metrics

cc717d8

mbani01 self-assigned this Mar 10, 2026

Copilot AI review requested due to automatic review settings March 10, 2026 15:41

Copilot started reviewing on behalf of mbani01 March 10, 2026 15:42 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

cursor bot reviewed Mar 10, 2026

View reviewed changes

mbani01 requested a review from joanagmaia March 10, 2026 16:42

mbani01 marked this pull request as draft March 10, 2026 17:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: improve maintainers detection [CM-1033]#3908

feat: improve maintainers detection [CM-1033]#3908
mbani01 wants to merge 4 commits intomainfrom
feat/improve_maintainer_file_detection

mbani01 commented Mar 10, 2026 •

edited by cursor bot

Loading

Uh oh!

CLAassistant commented Mar 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Mar 10, 2026

Uh oh!

cursor bot Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		candidate_files: list[str] = []
		ai_suggested_file: str \| None = None

Conversation

mbani01 commented Mar 10, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CLAassistant commented Mar 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Mar 10, 2026

Choose a reason for hiding this comment

Ripgrep error indistinguishable from no-matches result

Uh oh!

cursor bot Mar 10, 2026

Choose a reason for hiding this comment

Blocking os.walk call in async method

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mbani01 commented Mar 10, 2026 •

edited by cursor bot

Loading