fix(pyats): refactor pyats config file creation (incl disable EnvironmentDebugPlugin)#697
fix(pyats): refactor pyats config file creation (incl disable EnvironmentDebugPlugin)#697
Conversation
6a826c2 to
92ca3a8
Compare
…andling (#689, #570) Security fix: PyATS EnvironmentDebugPlugin was writing environment variables (including passwords) to env.txt files within archive artifacts. This change disables the plugin and refactors config file management for cleaner architecture. **Security (#689)** - Add `EnvironmentDebugPlugin: enabled: False` to plugin configuration - Add E2E regression test scanning ALL output artifacts for credential sentinels - Test covers both zip contents and all non-zip files **Architecture Refactoring (#570)** - Move config file creation from `execute_job()` into `SubprocessRunner.__init__()` - Add `cleanup()` method to SubprocessRunner for config file removal - Orchestrator calls `cleanup()` conditionally based on keep_artifacts policy - Config file attributes only set after successful file creation (safer error handling) - Extract config filenames as constants in `constants.py` - Write config files to output_dir with well-known names instead of temp files **Dead Code Removal** - Remove unused `_build_reporter_config()` and `_generate_plugin_config()` from orchestrator - Remove unnecessary `tempfile.TemporaryDirectory` wrapper in `_run_tests_async` - Remove unused `plugin_config_path` parameter Closes #689, #570 Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
92ca3a8 to
9214b78
Compare
… credential check - Remove 3 moot unit tests that checked config strings (fragile, not behavioral) - Remove duplicate integration test for config creation failure - Move credential sentinel test imports to module level - Run credential check for all scenarios, not just PyATS - Improve error message for PyATS config creation failure
…chestrator Verify that when SubprocessRunner fails to create config files (e.g., disk full), PyATSOrchestrator returns PyATSResults with from_error() for discovered test types.
…ix cleanup leak - Delete ConfigFileCreationError; _create_config_files now raises RuntimeError, consistent with the pyats-not-found failure in __init__ - Catch RuntimeError in orchestrator so both init failures populate api/d2d results correctly instead of falling through to the broad except-Exception handler - Call subprocess_runner.cleanup() in the report-generation failure branch so config files are removed regardless of outcome - Assert EnvironmentDebugPlugin disabled in plugin config integration test using regex; convert all config-content checks to re.search - Remove leftover try/finally unlink() blocks in TestConfigFileContent (files live in tmp_path, cleaned up by pytest)
…asserts Review feedback from PR #697: - Fix bug: if first config file writes successfully but second fails, clean up the first file to avoid orphaned files on disk - Replace assert statements with RuntimeError in _build_command() since asserts are stripped in optimized mode (-O) - Add clarifying comments explaining why cleanup isn't needed on SubprocessRunner init failure (attributes only set after success) - Add docstring note about planned nac-test-wide cleanup mechanism - Add test for partial write failure cleanup behavior
- Add SubprocessRunner.__del__ for opportunistic cleanup on unexpected exits (best-effort: CPython GC only, not guaranteed on SIGKILL; ref #677) - Mark unreachable None-guard with # pragma: no cover and explanatory comment - Update cleanup() docstring to reflect dual call sites (__del__ + orchestrator) - Extract _cleanup_subprocess_runner() helper in orchestrator to DRY up two identical call sites; documents that __del__ covers unexpected exit paths - Cast keep_artifacts to bool() to avoid bool | str | None union type - Remove stale comment in except RuntimeError block (made obsolete by __del__) - Strengthen test_cleanup_is_idempotent with file-existence assertions - Add test_del_calls_cleanup to verify __del__ removes config files
CleanupManager.register() gains a skip_if_debug parameter. Files registered with skip_if_debug=True are deleted on normal exit but retained when NAC_TEST_DEBUG is set, letting developers inspect intermediate artifacts without any extra flags or code changes. The merged data model YAML is the first consumer: it now registers with skip_if_debug=True so it survives for inspection in debug runs. The same mechanism is available for job scripts, testbed YAMLs, and any other artifacts tracked in PR #697. Internally, _files changes from set[Path] to dict[Path, bool], keeping the flag co-located with the path and avoiding two collections to sync. Remove the exit_with_cleanup() closure from main() — typer.Exit is caught by Click which calls sys.exit(), reliably triggering atexit handlers on every exit path. The closure and all cleanup_now() pre-exit calls were redundant with the already-registered atexit handler.
aitestino
left a comment
There was a problem hiding this comment.
Hey @oboehmer, thank you for the PR -- this is a well-motivated change that addresses a real security gap (EnvironmentDebugPlugin leaking credentials into archive artifacts) and takes the opportunity to clean up genuine architectural debt along the way. Moving config file ownership into SubprocessRunner so it owns both creation and cleanup is the right SRP boundary, the DRY win from eliminating those two copy-pasted config blocks in execute_job and execute_job_with_testbed is substantial (especially since they had already diverged -- one was missing the security fix), and the dead code removal of _build_reporter_config and _generate_plugin_config is overdue -- those methods were not only unused but broken (the dump_to_stream import references a module that doesn't exist). The E2E credential sentinel test is a strong regression guard, and the partial-write cleanup logic in _create_config_files is thoughtfully implemented.
I ran this through a pretty thorough review, and I think we need some adjustments before merge. The core design is sound and the security fix is correct (I traced through the PyATS source to confirm recursive_update properly merges the enabled: False override) -- these are all refinement items.
Things that need adjustment:
-
Cleanup gap on "no archives found" early return -- In
_generate_html_reports_async, the early return whenarchive_pathsis empty skips_cleanup_subprocess_runner()entirely. Thekeep_artifactsvariable is computed just above that block, so the fix is one line:self._cleanup_subprocess_runner(keep_artifacts)before the return. Alternatively, the two existing_cleanup_subprocess_runnercall sites on the success and failure branches could be collapsed into a singletry/finally-- that would also prevent this class of bug if a third return path is ever added. -
PLUGIN_CONFIGandPYATS_CONFIGshould live inconstants.py-- We try to centralize constants inconstants.py-- the filename constants (PYATS_PLUGIN_CONFIG_FILENAME,PYATS_CONFIG_FILENAME) are correctly placed there, but the content constants ended up insubprocess_runner.py. Moving them toconstants.pyasPYATS_PLUGIN_CONFIG_CONTENT: strandPYATS_CONFIG_CONTENT: str(with type annotations, matching the other constants in that file) would keep everything in one place. As a bonus, anyone auditing the plugin configuration finds filenames and their contents together. -
New constants missing from
__all__inconstants.py--PYATS_PLUGIN_CONFIG_FILENAMEandPYATS_CONFIG_FILENAMEaren't added to the__all__list that starts around line 113. Quick one-liner fix. -
Path.write_textpatches are too broad in both test files --patch.object(Path, "write_text", ...)intercepts everyPath.write_textcall in the process, not just the one in_create_config_files. Fortest_orchestrator_config_error.py, patching atSubprocessRunner._create_config_fileswithside_effect=RuntimeError(...)is more precise and decouples from the OSError-to-RuntimeError translation already covered in unit tests. For the unit tests, scoping to"nac_test.pyats_core.execution.subprocess_runner.Path.write_text"would be the right target. -
test_del_calls_cleanuptests the wrong contract -- It callsrunner.__del__()and asserts files are removed, buttest_cleanup_removes_config_filesalready covers that exact behavior. The actual__del__contract -- the thing that distinguishes it from callingcleanup()directly -- is that it swallows exceptions silently. Replacing with a test that patchescleanup()to raise and verifies__del__()does not propagate would test the real behavior. -
NAC_TEST_PYATS_KEEP_REPORT_DATAenv var string is scattered across 3 source files --orchestrator.py,multi_archive_generator.py, andgenerator.pyall evaluateos.environ.get("NAC_TEST_PYATS_KEEP_REPORT_DATA")independently. The existingDEBUG_MODE = get_bool_env("NAC_TEST_DEBUG")pattern shows exactly how to consolidate this:KEEP_REPORT_DATA: bool = get_bool_env("NAC_TEST_PYATS_KEEP_REPORT_DATA")inconstants.py. The PR'skeep_artifactslocal variable is a step in the right direction -- this would complete it. Pre-existing debt, but the PR adds a new evaluation path so now is a natural moment. -
object.__new__(SubprocessRunner)in two unit tests creates fragile coupling --test_write_failure_leaves_attributes_none_and_cleanup_is_safeandtest_partial_write_failure_cleans_up_first_filebypass__init__to test_create_config_filesin isolation. This couples to internal attribute names and creates partially-initialized objects that can't exist in production. For the partial-write test, the value is anchored in the filesystem assertion so it's tolerable. For the attribute-state test, consider either mocking the pyats executable check or folding the assertion into the existingtest_init_raises_runtime_error_on_write_failure. -
Missing
_cleanup_subprocess_runnerpolicy tests -- The helper has three behaviors worth covering: (a) no-op whensubprocess_runneris None, (b) skips cleanup whenkeep_artifacts=True, (c) delegates tocleanup()when both conditions are met. Three quick tests, each about 5 lines. -
Test file placement deepens Issue #541 debt --
test_orchestrator_config_error.pyis placed intests/pyats_core/, but that directory's ownconftest.pydocuments the intent to mergetests/pyats_core/intotests/unit/. Placing new files there moves in the wrong direction. Worth consideringtests/unit/pyats_core/instead, or at minimum acknowledging this as #541 scope. -
Security advisory: consider disabling
ArchiveConfigPluginin a follow-up -- TheArchiveConfigPlugin(also in PyATS'sdebug.py) remains enabled and writespyats.configuration.yamlandeasypy.configuration.yamlto archives. For nac-test's current usage the risk is limited since controller credentials flow via environment variables, but for D2D tests with testbed files, device credentials could potentially be reflected. AddingArchiveConfigPlugin: enabled: Falseto the plugin config would provide defense-in-depth. Not blocking -- the E2E sentinel test would catch it if credentials leak through this path.
I'd consider items 1-5 as things to address before merge -- they're all quick fixes (the cleanup gap is the most important one). Items 6-10 are lower priority and could be follow-ups if you'd rather not hold this up further.
What do you think?
P.S. — This comment was drafted using voice-to-text via Claude Code. If the tone comes across as overly direct or terse, please know that's just how it tends to phrase things. No offense or criticism is intended — this is purely an objective technical review of the PR. Thanks for understanding! 🙂
… and eliminate redundant re-read (#677 #704) (#727) * fix(pyats): prevent credential exposure in PyATS archives (#689) Disable EnvironmentDebugPlugin which was writing environment variables (including passwords) to env.txt files in PyATS result archives. Add E2E regression test that scans all output artifacts (including ZIP contents) for a sentinel password value to catch future credential leaks. * fix(security): auto-cleanup merged_data_model_test_variables.yaml (#677) Automatically delete merged_data_model_test_variables.yaml after test execution to prevent potential credential exposure in shared environments. Changes: - Add CleanupManager singleton for registering files to delete on exit - Register merged data file for cleanup immediately after creation - Remove duplicate data merging in RobotWriter (CLI already merges) - Simplify RobotWriter to accept merged_data_path instead of data_paths - Handle cleanup on normal exit (atexit) and SIGTERM signals The merged data file contains sensitive test variables that should not persist after test execution completes. Closes #677 * refactor: pass merged_data dict through Robot path, simplify DataMerger Eliminates the redundant YAML re-read in RobotWriter (fixes #704). Previously, main.py merged and wrote the data model to disk, then RobotWriter re-read and re-parsed it — costing ~51 s and ~400 MB peak RAM on large data models. The already-loaded dict is now passed through CombinedOrchestrator → RobotOrchestrator → RobotWriter directly. DataMerger is simplified in line with the removal of --merged-data-filename in #677: the filename parameter is removed from write_merged_data_model() (hardcoded to DEFAULT_MERGED_DATA_FILENAME), and load_yaml_file() is removed entirely as it had no remaining callers. RobotWriter._convert_data_model_for_templates() is deleted: the JSON round-trip was intended to convert CommentedMap to plain dicts for Jinja2, but CommentedMap is a dict subclass and is fully compatible natively. The redundant self.template_data alias is collapsed into self.data. Tests: - Add tests/unit/test_data_merger.py covering merge content correctness and write/roundtrip behaviour, replacing coverage lost when the commented-out integration tests were cleaned up - Add tests/unit/robot/test_robot_writer.py covering constructor semantics and render_template context selection - Move data_merge fixtures from tests/integration/fixtures/ to tests/unit/fixtures/ where they belong * fix: correct signal handler restoration for SIG_DFL (falsy value 0) signal.SIG_DFL has the integer value 0, which is falsy. The previous `original if original else signal.SIG_DFL` expression would silently treat a legitimate SIG_DFL original disposition as "not set" and still return SIG_DFL — technically a no-op here, but masking the real issue: any None (unrecorded handler) and SIG_DFL were indistinguishable. Switch to `original if original is not None else signal.SIG_DFL` so the intent is explicit and the two cases are correctly distinguished. Add parametrized tests covering SIG_DFL (falsy), SIG_IGN (truthy), and None (fallback) to pin the correct behaviour. * refactor: centralise merged data file path via DataMerger Consolidate all knowledge of where the merged data model file lives into DataMerger, eliminating independent path reconstruction across callers. - Rename DEFAULT_MERGED_DATA_FILENAME → MERGED_DATA_FILENAME; the DEFAULT_ prefix implied variability that no longer exists since the --merged-data-filename CLI option was removed - Add MERGED_DATA_FILE_MODE = 0o600 constant (owner read/write only) - Add DataMerger.merged_data_path() as the single query point for the file location; callers no longer construct output_dir / MERGED_DATA_FILENAME - write_merged_data_model() now returns the Path it wrote, owns the chmod call, and takes no filename argument — "write this sensitive file securely" is a single responsibility - Move merged_data_path injection into DeviceExecutor constructor so the orchestrator (which already holds the path) passes it down rather than DeviceExecutor re-deriving it from base_output_dir; removes the DataMerger import from device_executor.py entirely * refactor: remove exit_with_cleanup closure from main() typer.Exit is caught by Click internally, which calls sys.exit() — this triggers atexit handlers reliably on every exit path. The exit_with_cleanup closure (which called cleanup_now() before raising typer.Exit) was therefore redundant with the atexit handler already registered by CleanupManager. Replace all exit_with_cleanup() call sites with plain raise typer.Exit(), remove the closure, and drop the now-unused NoReturn import. * feat: preserve debug-relevant files when NAC_TEST_DEBUG is set CleanupManager.register() gains a skip_if_debug parameter. Files registered with skip_if_debug=True are deleted on normal exit but retained when NAC_TEST_DEBUG is set, letting developers inspect intermediate artifacts without any extra flags or code changes. The merged data model YAML is the first consumer: it now registers with skip_if_debug=True so it survives for inspection in debug runs. The same mechanism is available for job scripts, testbed YAMLs, and any other artifacts tracked in PR #697. Internally, _files changes from set[Path] to dict[Path, bool], keeping the flag co-located with the path and avoiding two collections to sync. Remove the exit_with_cleanup() closure from main() — typer.Exit is caught by Click which calls sys.exit(), reliably triggering atexit handlers on every exit path. The closure and all cleanup_now() pre-exit calls were redundant with the already-registered atexit handler. * test: assert merged data YAML is absent after a successful run * restore comments in main * refactor: place merged_data before optional params in orchestrator signatures * fix: log full exception traceback to file while keeping terminal output clean (#573) Add logger.exception() before the typer.echo() in the top-level exception handler so the full stack trace (including chained causes) is captured in the log file. Terminal output remains a single clean line for end users. * refactor: resolve merged_data_path to absolute once at construction Call .absolute() when assigning self.merged_data_path in PyATSOrchestrator so all use sites get a consistent absolute path without redundant calls. Move the explanatory comment to the assignment and rephrase it to not imply a cwd change is always in effect. * register DeviceExecutor temp files with CleanupManager The per-device job (.py) and testbed (.yaml) files created with NamedTemporaryFile(delete=False) were never deleted, accumulating in the OS temp directory across runs. Register them with CleanupManager so they are removed on normal exit, SIGTERM, and SIGINT. skip_if_debug=True is used consistently with merged_data_model_test_variables.yaml so that NAC_TEST_DEBUG retains all intermediate artifacts for debugging. Also removes the longstanding "UNCOMMENT ME" commented-out deletion block that this replaces. * extract _setup_run_tests_mocks helper in test_orchestrator The mock-wiring and artifact-creation block (pabot return value, RobotReportGenerator stub, robot_results/ stub files) was copy-pasted verbatim across test_verbose_flag_passed_to_pabot and test_include_exclude_tags_passed_to_pabot. Extract it into a _setup_run_tests_mocks static helper so future tests can reuse it without duplicating the setup. * tests: remove duplicate test_merged_data_path_matches_written_file test_returns_path_to_written_file already asserts the same equality (with operands swapped) and additionally checks that the file exists, making the duplicate redundant. * tests: add integration test for CleanupManager atexit via typer.Exit(0) Parametrized subprocess test covering the production exit path with skip_if_debug in both debug and non-debug modes — the atexit handler can't be exercised from in-process unit tests. * refactor: rename skip_if_debug to keep_if_debug in CleanupManager The old name required parsing "skip [deletion] if debug" to understand the intent. keep_if_debug=True reads directly as what it does. * tests: add subprocess-level signal coverage for CleanupManager Fill the contract gap where unit tests call _signal_handler() directly, bypassing the _install_signal_handlers() registration path. - Add test_cleanup_on_signal parametrized over SIGTERM and SIGINT: spawns a subprocess, waits for a "ready" handshake, delivers the signal, and asserts the sentinel file was deleted - Add timeout=30 to the existing subprocess.run() call - Rename _SCRIPT → _ATEXIT_SCRIPT, add _SIGNAL_SCRIPT at module level - Move @pytest.mark.windows to the atexit test only; signal test runs Linux-only by CI convention * remote trivial changes from new test_robot_writer.py * fix: guard bad directive syntax, fix stale docstring, use LOG_HTML constant - robot_writer.py: raise ValueError with file/directive context when iterate_list_chunked chunk_size cannot be parsed — propagates through the existing except Exception handler in combined_orchestrator, aborting execution cleanly instead of silently continuing with a bad directive - robot/orchestrator.py: remove stale step "3. Creates merged data model file" from run_tests() docstring; renumber remaining steps 4-6 → 3-5 - combined_orchestrator.py: replace bare "log.html" with LOG_HTML constant * add PRD, adjust comment * minor adjustments in cleanup.py * tests: restore integration rendering tests and fix fixture placement * refactor: replace _cleanup/cleanup_now with single public run_cleanup() Remove the private _cleanup() method and the cleanup_now() wrapper — the split served no purpose since both had identical behaviour and there was no reason to bypass the public API in atexit.register() or _signal_handler(). Inline the implementation directly into run_cleanup(), which is now the single method registered with atexit, called from the signal handler, and available for manual invocation. * reference MERGED_DATA_FILENAME in tests * add comment on fork safety * tests: improve CleanupManager unit test quality - Parametrize test_unregister_removes_path over keep_if_debug values - Remove test_register_multiple_files (only tested dict insertion count) - Strengthen test_cleanup_is_idempotent — assert file deleted on first call - Replace test_concurrent_registration with test_concurrent_register_and_cleanup: races register() against run_cleanup() to verify the lock prevents silent data loss, rather than just testing concurrent dict insertions * update PRD * fix: use RLock to prevent signal-handler deadlock in CleanupManager Python signal handlers run in the main thread between bytecodes. If SIGTERM/SIGINT arrives while the main thread holds the Lock inside register() or unregister(), the signal handler's run_cleanup() call attempts to acquire the same non-reentrant Lock, causing a deadlock. Replace threading.Lock with threading.RLock (reentrant) so the same thread can re-acquire safely from within the signal handler. Addresses PR review comment #1. * fix: always clean up merged data file regardless of debug mode The merged data YAML may contain credentials resolved from !env references. Registering it with keep_if_debug=True meant it would persist when NAC_TEST_DEBUG is set, which CI/CD pipelines might enable inadvertently — leaving credentials in build artifacts. Remove the keep_if_debug flag so the sensitive file is always deleted on exit. Debug users can still inspect it during execution; it only needs to survive until cleanup runs. Addresses PR review comment #2. * fix: delete files immediately when registered after cleanup has run Once run_cleanup() completes it sets _cleanup_done=True, making subsequent run_cleanup() calls no-ops. Files registered after that point were silently added to _files but never deleted — a bug since the docstring says run_cleanup() is safe to call multiple times. Fix: register() now checks _cleanup_done and, if True, deletes the file immediately via the new _delete_file() helper (extracted from run_cleanup() to share the deletion + keep_if_debug logic). Adds test_register_after_cleanup_deletes_immediately to verify the new behavior. Addresses PR review comment #4. * refactor: use _setup_run_tests_mocks helper in full execution test test_run_tests_full_execution inlined the same mock setup that the _setup_run_tests_mocks helper was introduced to eliminate. Delegate to the helper and override only the specific TestResults return value. Addresses PR review comment #5. * docs: add nac-yaml 2.0 dependency comment in RobotWriter Note why _convert_data_model_for_templates() was safe to remove: nac-yaml 2.0+ guarantees plain dict/list output, so Jinja2 receives ordinary Python types directly without a JSON round-trip conversion. Addresses PR review comment #6. * Revert "fix: delete files immediately when registered after cleanup has run" This reverts commit ff191a9. * fix(cleanup): remove redundant _cleanup_done flag _files.clear() already makes run_cleanup() naturally idempotent. Removing the flag ensures files registered between cleanup invocations (e.g. signal handler → atexit) are not silently lost. * refactor: adapt robot_writer shim to new merged_data constructor API The canonical RobotWriter constructor changed from data_paths: list[Path] to merged_data: dict[str, Any]. The shim now acts as an adapter: accepts the old data_paths API, calls DataMerger.merge_data_files() internally, and passes the merged dict to the canonical constructor. Also removes the signature reminder tests (follow-up to PR #765 where they should have been dropped — the integration tests now provide strictly better coverage of canonical API changes).
Replace manual cleanup()/\__del__() with centralized CleanupManager registration (keep_if_debug=True). This removes SubprocessRunner.cleanup(), SubprocessRunner.__del__(), and PyATSOrchestrator._cleanup_subprocess_runner() along with both call sites in _generate_html_reports_async(). Addresses PR #697 review comments #1 (cleanup gap on early return), #5 (test_del_calls_cleanup wrong contract), and #8 (missing policy tests for _cleanup_subprocess_runner — method now deleted).
- Add missing __all__ entries in constants.py (review #3) - Narrow Path.write_text patches to module scope in test_subprocess_runner instead of global patch.object (review #4) - Rewrite test_partial_write_failure and test_orchestrator_config_error to remove fragile object.__new__ coupling and update to current API (review #7)
|
Thanks for the thorough review @aitestino! Here's how we addressed each item:
The key architectural change: by adopting |
Description
Security fix: PyATS EnvironmentDebugPlugin was writing environment variables (including passwords) to
env.txtfiles within archive artifacts. This change disables the plugin and refactors config file management for cleaner architecture.Also refactor pyats config file creation and remove dead code
Closes
Related Issue(s)
__del__in this PR is a stopgap)Type of Change
Test Framework Affected
Network as Code (NaC) Architecture Affected
Platform Tested
Key Changes
Security (#689)
EnvironmentDebugPlugin: enabled: Falseto plugin configurationArchitecture Refactoring (#570)
execute_job()intoSubprocessRunner.__init__()cleanup()method to SubprocessRunner for config file removal__del__for opportunistic cleanup on unexpected exits (best-effort; see Remove merged data model artifact after test execution to prevent potential credential exposure #677)_cleanup_subprocess_runner()helper in orchestrator to DRY up two identical call sitescleanup()conditionally based on keep_artifacts policyconstants.pyDead Code Removal
_build_reporter_config()and_generate_plugin_config()from orchestratortempfile.TemporaryDirectorywrapper in_run_tests_asyncplugin_config_pathparameterTesting Done
pytest/pre-commit run -a)Test Commands Used
Checklist
pre-commit run -apasses)Screenshots (if applicable)
N/A
Additional Notes
Design decisions:
SubprocessRunnernow owns both creation AND cleanup of its config files (cleaner SRP)__del__cleanup is a safe no-op if files are already gone or were never created