feat: Improve isolation of agent evaluation CUJs by staging the work_dir to a new tmp directory tree by g-lynnzee · Pull Request #462 · GoogleCloudPlatform/evalbench

g-lynnzee · 2026-06-26T03:55:30Z

I ran into this where the agent is able to access and thus will use files in the working_dirs for other CUJs.

I have a test case that relies on the agent trying and failing to find a dataset file, and acting a certain way. However, the agent sometimes will read other working_dirs and act based on datasets files there.

Usage example: ./evalbench/evalbench.py \ --experiment_config=core-cujs/run_gemini_cli.yaml \ --scenarios=autoctx-expert-eval-confirm

…_dir I ran into this for GoogleCloudPlatform/db-context-enrichment#168 where the agent will read working_dirs for other CUJs. The test case relies on the agent trying and failing to find a dataset file, and acting a certain way. However, sometimes the agent will read other working_dirs and act based on datasets files there.

This reverts commit dc1a38d.

This reverts commit 228cdf6.

IsmailMehdi · 2026-06-26T21:55:13Z

                else:
                    break

+        self._cleanup_sandbox(resolved_work_dir, temp_sandbox_dir)


This cleanup call isn't reached on exception. The turn-loop body has its own try/except for safe_generate / generate, but any exception that bypasses those (e.g. an uncaught error in extract_tools / extract_skills, simulated_user.get_next_response, KeyboardInterrupt, or a bug somewhere else in the loop body) skips this line and leaves the tempdir on disk. On a long-running eval server (or a flaky scenario that fails repeatedly), this leaks evalbench-sandbox-* directories under /tmp indefinitely.

Wrap the post-setup block in try/finally:

resolved_work_dir = scenario.get("resolved_work_dir") execution_cwd, temp_sandbox_dir = self._setup_sandbox(resolved_work_dir) try: # fake_home prep + turn loop + _finalize_scenario all live here ... finally: self._cleanup_sandbox(resolved_work_dir, temp_sandbox_dir)

Worth a unit test alongside the change: patch self.generator.safe_generate to raise, call process_scenario, then assert no evalbench-sandbox-* directories remain under tempfile.gettempdir(). That locks the contract in.

IsmailMehdi · 2026-06-26T21:57:56Z

+        shutil.copytree(resolved_work_dir, temp_sandbox_dir, dirs_exist_ok=True)
+        return temp_sandbox_dir, temp_sandbox_dir
+
+    def _register_trusted_folders(self, fake_home: str, execution_cwd: str | None, resolved_work_dir: str | None) -> None:


This seems to only work for gemini, we should support others and restrict this one to gemini.

IsmailMehdi · 2026-06-26T22:01:14Z

+        if not execution_cwd:
+            return
+        trusted_folders_path = os.path.join(fake_home, ".gemini", "trustedFolders.json")
+        os.makedirs(os.path.dirname(trusted_folders_path), exist_ok=True)


worth looking at mktemp ?

IsmailMehdi · 2026-06-26T22:02:39Z

PTAL #461 also

g-lynnzee added 5 commits June 21, 2026 21:14

Feat: Add ability to filter to specific json scenario/examples

228cdf6

Usage example: ./evalbench/evalbench.py \ --experiment_config=core-cujs/run_gemini_cli.yaml \ --scenarios=autoctx-expert-eval-confirm

Update filtering logig

dc1a38d

Revert "Update filtering logig"

d585315

This reverts commit dc1a38d.

Revert "Feat: Add ability to filter to specific json scenario/examples"

5aaabff

This reverts commit 228cdf6.

g-lynnzee requested a review from IsmailMehdi as a code owner June 26, 2026 03:55

Merge branch 'main' into main

bef1a67

g-lynnzee mentioned this pull request Jun 26, 2026

feat: Enable sheperding of users through context engineering workflow GoogleCloudPlatform/db-context-enrichment#168

Merged

g-lynnzee added 2 commits June 25, 2026 21:03

Remove newlines

5484575

style guide fixes

57edee5

g-lynnzee changed the title ~~feat: Improve isolation of agent evaluation CUJs by copying and using only the working_dir~~ feat: Improve isolation of agent evaluation CUJs by staging the work_dir to a new tmp directory tree Jun 26, 2026

IsmailMehdi reviewed Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Improve isolation of agent evaluation CUJs by staging the work_dir to a new tmp directory tree#462

feat: Improve isolation of agent evaluation CUJs by staging the work_dir to a new tmp directory tree#462
g-lynnzee wants to merge 8 commits into
GoogleCloudPlatform:mainfrom
g-lynnzee:main

g-lynnzee commented Jun 26, 2026

Uh oh!

IsmailMehdi Jun 26, 2026

Uh oh!

IsmailMehdi Jun 26, 2026

Uh oh!

IsmailMehdi Jun 26, 2026

Uh oh!

IsmailMehdi commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

g-lynnzee commented Jun 26, 2026

Uh oh!

IsmailMehdi Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

IsmailMehdi Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

IsmailMehdi Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

IsmailMehdi commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants