feat: Improve isolation of agent evaluation CUJs by staging the work_dir to a new tmp directory tree#462
feat: Improve isolation of agent evaluation CUJs by staging the work_dir to a new tmp directory tree#462g-lynnzee wants to merge 8 commits into
Conversation
Usage example: ./evalbench/evalbench.py \ --experiment_config=core-cujs/run_gemini_cli.yaml \ --scenarios=autoctx-expert-eval-confirm
…_dir I ran into this for GoogleCloudPlatform/db-context-enrichment#168 where the agent will read working_dirs for other CUJs. The test case relies on the agent trying and failing to find a dataset file, and acting a certain way. However, sometimes the agent will read other working_dirs and act based on datasets files there.
This reverts commit dc1a38d.
This reverts commit 228cdf6.
| else: | ||
| break | ||
|
|
||
| self._cleanup_sandbox(resolved_work_dir, temp_sandbox_dir) |
There was a problem hiding this comment.
This cleanup call isn't reached on exception. The turn-loop body has its own try/except for safe_generate / generate, but any exception that bypasses those (e.g. an uncaught error in extract_tools / extract_skills, simulated_user.get_next_response, KeyboardInterrupt, or a bug somewhere else in the loop body) skips this line and leaves the tempdir on disk. On a long-running eval server (or a flaky scenario that fails repeatedly), this leaks evalbench-sandbox-* directories under /tmp indefinitely.
Wrap the post-setup block in try/finally:
resolved_work_dir = scenario.get("resolved_work_dir")
execution_cwd, temp_sandbox_dir = self._setup_sandbox(resolved_work_dir)
try:
# fake_home prep + turn loop + _finalize_scenario all live here
...
finally:
self._cleanup_sandbox(resolved_work_dir, temp_sandbox_dir)Worth a unit test alongside the change: patch self.generator.safe_generate to raise, call process_scenario, then assert no evalbench-sandbox-* directories remain under tempfile.gettempdir(). That locks the contract in.
| shutil.copytree(resolved_work_dir, temp_sandbox_dir, dirs_exist_ok=True) | ||
| return temp_sandbox_dir, temp_sandbox_dir | ||
|
|
||
| def _register_trusted_folders(self, fake_home: str, execution_cwd: str | None, resolved_work_dir: str | None) -> None: |
There was a problem hiding this comment.
This seems to only work for gemini, we should support others and restrict this one to gemini.
| if not execution_cwd: | ||
| return | ||
| trusted_folders_path = os.path.join(fake_home, ".gemini", "trustedFolders.json") | ||
| os.makedirs(os.path.dirname(trusted_folders_path), exist_ok=True) |
There was a problem hiding this comment.
worth looking at mktemp ?
|
PTAL #461 also |
I ran into this where the agent is able to access and thus will use files in the working_dirs for other CUJs.
I have a test case that relies on the agent trying and failing to find a dataset file, and acting a certain way. However, the agent sometimes will read other working_dirs and act based on datasets files there.