Skip to content

feat: Improve isolation of agent evaluation CUJs by staging the work_dir to a new tmp directory tree#462

Open
g-lynnzee wants to merge 8 commits into
GoogleCloudPlatform:mainfrom
g-lynnzee:main
Open

feat: Improve isolation of agent evaluation CUJs by staging the work_dir to a new tmp directory tree#462
g-lynnzee wants to merge 8 commits into
GoogleCloudPlatform:mainfrom
g-lynnzee:main

Conversation

@g-lynnzee

Copy link
Copy Markdown
Contributor

I ran into this where the agent is able to access and thus will use files in the working_dirs for other CUJs.

I have a test case that relies on the agent trying and failing to find a dataset file, and acting a certain way. However, the agent sometimes will read other working_dirs and act based on datasets files there.

Usage example:

./evalbench/evalbench.py \
  --experiment_config=core-cujs/run_gemini_cli.yaml \
  --scenarios=autoctx-expert-eval-confirm
…_dir

I ran into this for GoogleCloudPlatform/db-context-enrichment#168 where the agent will read working_dirs for other CUJs.

The test case relies on the agent trying and failing to find a dataset file, and acting a certain way. However, sometimes the agent will read other working_dirs and act based on datasets files there.
@g-lynnzee g-lynnzee requested a review from IsmailMehdi as a code owner June 26, 2026 03:55
@g-lynnzee g-lynnzee changed the title feat: Improve isolation of agent evaluation CUJs by copying and using only the working_dir feat: Improve isolation of agent evaluation CUJs by staging the work_dir to a new tmp directory tree Jun 26, 2026
else:
break

self._cleanup_sandbox(resolved_work_dir, temp_sandbox_dir)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cleanup call isn't reached on exception. The turn-loop body has its own try/except for safe_generate / generate, but any exception that bypasses those (e.g. an uncaught error in extract_tools / extract_skills, simulated_user.get_next_response, KeyboardInterrupt, or a bug somewhere else in the loop body) skips this line and leaves the tempdir on disk. On a long-running eval server (or a flaky scenario that fails repeatedly), this leaks evalbench-sandbox-* directories under /tmp indefinitely.

Wrap the post-setup block in try/finally:

resolved_work_dir = scenario.get("resolved_work_dir")
execution_cwd, temp_sandbox_dir = self._setup_sandbox(resolved_work_dir)
try:
    # fake_home prep + turn loop + _finalize_scenario all live here
    ...
finally:
    self._cleanup_sandbox(resolved_work_dir, temp_sandbox_dir)

Worth a unit test alongside the change: patch self.generator.safe_generate to raise, call process_scenario, then assert no evalbench-sandbox-* directories remain under tempfile.gettempdir(). That locks the contract in.

shutil.copytree(resolved_work_dir, temp_sandbox_dir, dirs_exist_ok=True)
return temp_sandbox_dir, temp_sandbox_dir

def _register_trusted_folders(self, fake_home: str, execution_cwd: str | None, resolved_work_dir: str | None) -> None:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to only work for gemini, we should support others and restrict this one to gemini.

if not execution_cwd:
return
trusted_folders_path = os.path.join(fake_home, ".gemini", "trustedFolders.json")
os.makedirs(os.path.dirname(trusted_folders_path), exist_ok=True)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth looking at mktemp ?

@IsmailMehdi

Copy link
Copy Markdown
Collaborator

PTAL #461 also

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants