Skip to content

docs: Add FAQ section for common questions#18

Open
meichuanyi wants to merge 318 commits into
0xSteph:mainfrom
meichuanyi:docs-add-faq
Open

docs: Add FAQ section for common questions#18
meichuanyi wants to merge 318 commits into
0xSteph:mainfrom
meichuanyi:docs-add-faq

Conversation

@meichuanyi

Copy link
Copy Markdown

Adds FAQ covering features, paths, MCP tools, benchmarks, tiers, help.

0xSteph added 30 commits April 29, 2026 18:40
…tests

See CHANGELOG for the full list of fixes shipped in this release.
The big ones:

- Templates packaging fix means HTML and PDF reports finally work
  for everyone who installed via pip
- LLM gate replaced with first-run setup prompt so customers using
  Claude Code MCP no longer need to set a fake API key
- Agents fall back to deterministic mode when LLM is unreachable
  instead of returning silent zero-finding results
- 4 previously-fictional agents (api_security, credential_tester,
  vuln_scanner, privesc) now exist as real BaseAgent classes with
  both LLM-driven and deterministic implementations
- 8 specialist agents wired up as MCP tools: total 33 -> 41
- Recon results capped per phase to prevent 27,680-finding explosions
- Stale 'running' engagements get reconciled to 'interrupted' on
  the next ptai start

Also adds 4 integration tests for the reconciler that hit a real
SQLite DB. The original reconciler shipped with started_at instead
of created_at in the WHERE clause; mocked CLI tests didn't catch it
because they don't talk to actual SQL. Fixed in 7fcb503; the new
test_reconcile_uses_real_columns guards against that class of
schema drift in CI.
Code (9 fixes):
- Item 7: subdomain cap configurable via PENTEST_AI_MAX_FINDINGS_<PHASE> env vars
- Item 9: orchestrator catches asyncio.CancelledError, marks engagement as
  'interrupted' on SIGINT/SIGTERM instead of leaving 'running' forever
- Item 11: ptai resume on already-completed engagement now says
  "already completed; nothing to resume" with a hint to use ptai retest,
  instead of the misleading "resumed and completed"

CI / packaging (2 new):
- Item 10: .github/workflows/docker.yml builds + pushes ghcr.io image on
  every v*.*.* tag. Smoke-tests the published image before declaring success.
- Item 1: docs/release-pypi.md walks through the one-time PyPI Trusted
  Publishing setup that the v0.10.2 + v0.10.3 release runs were missing.

Marketing site:
- Item 14: pricing/trust-bar counts updated 191/12/33 -> 194/17/41 to match
  the actual product (committed in pentest-ai-preview-v4 separately).

Docs (4 drafts ready for legal/ops review):
- Item 3: docs/legal/PRIVACY.md and docs/legal/TERMS.md
- Item 5: docs/status-page-runbook.md
- Item 17: docs/launch-playbook.md (T-14 through T+7 plan)

655 tests pass.
WebAgent's deterministic _run_tool_phase ran tools sequentially. nuclei +
nikto + skipfish in series could push a single phase past 5 minutes; the
end-to-end test in 0.10.3 hit a 20s timeout on test_web_app for that reason.

Fix: each phase now schedules its installed tools with asyncio.gather and
caps each tool at 30s wall-clock. ReconAgent, ADAgent, APISecurityAgent,
and WirelessAgent had the same pattern and got the same fix. ReconAgent
preserves its phase finding-cap; cap is applied after collection rather
than mid-stream so we can run tools concurrently.

Verified by tests/test_agents_parallel.py: 5 mock tools that each sleep
0.5s complete in ~0.5s wall-clock instead of 2.5s.

658 tests pass.
Pre-launch security sweep with bandit found 8 HIGH severity issues. Triage:

REAL FIXES:
- engine/tool_installer.py: install_tool and install_tier had subprocess.run
  with shell=True and an f-string that interpolated sudo_password. CWE-78
  command-injection if a password ever contained shell metachars. Switched
  to argv form with the password piped via stdin.
- cli/menu.py: os.system(candidate + " --quiet") replaced with
  subprocess.run([candidate, "--quiet"], check=False). The path is currently
  a fixed location but defense-in-depth keeps a future user-driven path safe.

ANNOTATED AS INTENTIONAL:
- engine/scanners.py x5: httpx.AsyncClient(verify=False, ...) marked with
  # nosec B501 + a comment. Built-in scanners deliberately scan targets
  with potentially-broken SSL; cert validity is part of what we report on,
  not an abort condition.

ALSO:
- engine/sarif.py: tool.driver.version was hardcoded "0.8.0" (lying to
  GitHub Code Scanning about which ptai produced the SARIF). Now resolves
  from importlib.metadata at export time.
- .gitleaks.toml: allowlist tests/ and engine/scanners.py so the test
  fixture JWT and the secret-pattern regexes don't fail a launch-blocking
  scan. gitleaks runs clean against full git history (94 commits) with
  this config.

Verified:
- bandit -r ... --severity-level high: 0 issues (was 8)
- pip-audit on project venv: 0 known CVEs
- gitleaks detect: 0 leaks
- 658 tests pass.
See CHANGELOG for details. Three real bandit HIGH severity fixes:
- tool_installer: shell=True + f-string sudo password (CWE-78)
- cli/menu: os.system replaced with subprocess.run
- SARIF tool version was hardcoded 0.8.0; now dynamic via importlib.metadata

bandit HIGH: 8 -> 0. pip-audit: clean. gitleaks: clean. 658 tests pass.
Adds a new security workflow that runs on every push/PR with:
- bandit at -lll -ii (HIGH severity, medium+ confidence) matching project baseline
- pip-audit --strict against editable install
- gitleaks with .gitleaks.toml config

Also fixes em dash in release.yml comment.
Adds 61 tests across 41 MCP tools, hitting all error paths, profile
resolution flows, agent delegation, browser action dispatch, evidence
filesystem walks, campaign creation, and intensity changes.

Full suite: 719 passed, 19 skipped (no regression).
Adds 25 tests for detect_os, has_go/has_pip, audit_tools tier filter,
print_audit table rendering, install_tool across apt/go/pip/snap/manual
methods (success and failure), and install_tier with sudo, apt-update,
and skip-tools paths. All subprocess calls mocked.

Full suite: 744 passed, 19 skipped.
Adds 44 tests for SecurityTool._build_command argument-filter branches
(blocked keys, allowed_args, regex, bool, shell injection), execute()
with cache hit, cache miss + write, parse_output dispatch,
FileNotFoundError, generic Exception, and configure_cache state plus
_persist_tool_result paths.

Full suite: 814 passed, 19 skipped.
Adds 40 tests for BaseAgent setters, _check_scope branches, the
_execute_tool_call dispatcher (analyze_findings, store_finding success
+ dedup + invalid severity, builtin/security routing, unknown), the
_run_builtin_scanner success/timeout/exception/scope-violation paths,
the _run_security_tool registry-miss/not-installed/scope/timeout/retry
paths and auth-arg injection, run_tool_loop deterministic fallback and
LLM-driven termination, think() first-call vs mid-loop failure, and
_truncate edge cases.

Specialist agents inherit; this lifts the coverage floor for the whole
agent family.

Full suite: 854 passed, 19 skipped.
Two tests gated on ANTHROPIC_API_KEY + PTAI_E2E_LIVE=1:
- direct provider completion (proves API reachable + auth works)
- BaseAgent.think round-trip (proves the LLM tool-loop wiring is alive)

Skipped in default test runs and in CI. Run before recording the demo
video and before each release: this is the only proof we have that the
flagship LLM-driven path actually works against a live API.

Documented in docs/launch-playbook.md T-14 checklist.
Splits test and lint jobs:
- test runs across 6 cells (ubuntu/macos/windows × py3.10/py3.12)
  with fail-fast disabled so a single OS regression does not mask others
- lint + mypy run once on ubuntu/py3.12 (no value in repeating across OSes)

Switched from uv venv + activate (bash-only) to plain pip + actions/setup-python
caching, which works identically on all three OSes.

Local: 854 passed, 21 skipped.
Adds a global cap that fans out across every PENTEST_AI_MAX_FINDINGS_*
env var the recon agent honors. Default 0 keeps the per-phase defaults
(200 for subdomain enum, 500 for port scan, etc.). Per-var env settings
still win when set explicitly via setdefault.

Closes Phase 13.3 from the public-launch plan.

Full suite: 856 passed, 21 skipped.
Adds 17 tests for APISecurityAgent, CredentialTesterAgent, VulnScannerAgent,
PrivescAdvisorAgent, SocialEngineerAgent. Each has the same shape: LLM
unavailability falls back to a deterministic tool-driven path. Tests
exercise both branches plus tool-execute exceptions.

Coverage: api_security 27→95%, credential_tester 31→100%,
vuln_scanner 31→100%, privesc 29→100%, social_engineer 50→100%.
Adds 31 tests for _same_host, _normalize_url, _FormLinkParser link/form
extraction, _extract_endpoints param parsing + destructive-param skips,
_finding truncation, _send_probe GET/POST/exception, _probe_sqli
error-marker + no-marker + probe-failure paths, _probe_xss reflection
detection, _probe_cmdi id-marker detection, run_authenticated_scan
session vs authenticator dispatch, and _crawl non-html / skip-substring
/ http-error handling.

All httpx calls mocked, no live network.
- engine/auth_handler.py: AuthCredentials.from_dict / from_cli_args
  branches and build_auth_args across nuclei, sqlmap, ffuf flag mapping
- engine/llm/providers/anthropic.py: complete() with simple message,
  with system+tools, and round-trip translating tool_calls and tool_results
- agents/report/renderer.py: risk_level CRITICAL/HIGH/MEDIUM/LOW branches
  plus render_pdf weasyprint dispatch

Global coverage: 77%.
cli/auth.py: 33 tests for api_base override, store/load_api_key with
env-priority and OSError handling, key_source resolution, validate_key_remote
success/invalid/HTTP-error/non-200 paths, ingest_engagement no-key/success/
402 quota/unparseable-body/error/network-error paths, and mask_key.

cli/mcp_setup.py: platform detection, command/entry shape, generate_config_snippet,
detect_installed_clients no-config edge case, inject_config dry-run/write/missing.
- Ollama: complete with simple/tools+JSON-arg variants, base_url normalization
- OpenAI: complete simple, tool_calls JSON parse + bad-JSON fallback,
  _format_message with tool_calls and tool_call_id
- PoCAgent: validate_finding not-found, static-poc injection/unknown,
  validate_all severity filter
- AD/Wireless/Cloud: deterministic and LLM-unavailable fallback paths

Global coverage: 78%.
Tests SIGINT handler install/uninstall idempotency and graceful
non-main-thread degradation, all REPL commands (resume, step, abort,
skip, inject, help, blank, unknown, EOF, KeyboardInterrupt), and
_inspect findings/chain/summary branches with truncation.

Global coverage: 79%.
- cli/credential_resolvers/aws_sm: empty ref, plain string secret,
  JSON-field extraction, missing field, binary secret, empty secret,
  GetSecretValue failure, missing-boto3 paths
- agents/recon._env_int: default, valid override, invalid value, zero/negative
- agents/report/renderer.write_report: html-only, html+pdf with RuntimeError
  fallback, html+pdf success

Global coverage: 80% (target hit). Test count: 1016 passed, 21 skipped.
The pre-push hook runs against a venv without ptai[tracing] extras, so
the OTLP exporter and console-exporter tests (which import unconditionally
inside Tracer._init_if_needed) need importorskip guards.

Result: 26 pass with [tracing] installed, 20 pass + 6 skip without.
Pre-push secret scanner flagged AKIA[...]EXAMPLE in test fixtures.
The string is AWS's documented example (not a real credential), but
the regex match is reasonable signal. Replace with a clearly-fake
placeholder; the parsers match on the surrounding 'secret:' / 'rule:'
phrases, not the key format itself.
security.yml: gitleaks-action SHA cb71... no longer resolves on GitHub.
Replace with the latest pinned tag v2.3.9 (ff98106e).

Windows CI matrix surfaced pre-existing platform issues (the Linux-only
CI never ran them). Two fixes:

1. cli/auth_profiles._check_perms now skips the 0o600 enforcement on
   Windows. Windows uses ACLs, not Unix mode bits, so st_mode reads
   back 0o666 regardless. Hardening on Windows requires DACLs out of
   band; the code now documents that.

2. Tests that assert specific Unix mode bits (test_save_creates_0600_file,
   test_load_refuses_world_readable_file, test_store_and_load_api_key)
   are skipped on Windows.

3. test_expanduser now sets USERPROFILE alongside HOME so '~' resolves
   correctly on Windows.

Local suite: 1016 passed, 21 skipped.
- test_load_refuses_group_readable_file is the second perm-mode test
  that needs the Windows skip mark (paired with the world-readable one)
- pip-audit failed on CVE-2026-3219 in the runner's pre-installed pip;
  upgrade pip in the venv before audit so the project's own deps are
  what gets audited
mcp_server/server.py:758 has called agent.get_cookies(url) since the
browser_inspect tool was wired in 0.10.x, but the method was never
implemented. The 'cookies' action would have raised AttributeError if
anyone selected it. mypy --ignore-missing-imports caught this once the
new CI matrix included a type-check step on Python 3.12.

Implementation mirrors extract_forms: open a page, navigate, read
cookies via the Playwright BrowserContext API.
engine/tracing.py:163-178 wrapped both span-start and the user-yield
in a single try/except. When user code raised inside the span:
  1. Inner except marked status=error and re-raised
  2. Outer except caught the same exception
  3. Outer except yielded a NoopSpan as a 'fallback'
  4. The contextmanager generator had already yielded once, so the
     second yield raised 'generator didn't stop after throw()'

Refactor splits the two concerns:
  - Outer try wraps only start_as_current_span() init (real failure
    case where NoopSpan fallback is correct)
  - Once we yield the wrapper, user exceptions propagate cleanly to
    the caller through the with-statement exit machinery

Adds test_span_user_exception_propagates_cleanly to lock this in.
Removes the apologetic NOTE that documented the original bug.

Full suite: 1017 passed, 21 skipped.
- README: add Community section linking to GitHub Discussions
- docs/launch/launch-checklist.md: go/no-go gate document
- docs/launch/soc2-kickoff.md: vendor matrix + SOC2 Type I kickoff plan
- docs/launch/community-channel.md: GitHub Discussions decision record
- docs/launch/demo-script.md: 90s demo video beat sheet + toolchain
- docs/launch/testimonial-outreach.md: outreach templates + tracker
- docs/launch/install-matrix.md: cross-platform install verification matrix
0xSteph and others added 28 commits May 25, 2026 14:46
Phase 4. Two more probes gain OOB code paths gated on engagement_id
in session extras:

  engine/probes/web/xxe_upload.py — after the existing three-step
  in-band probe (baseline, file-disclosure, billion-laughs DoS), an
  external-DTD + SVG-DTD payload fires per candidate path under both
  shapes (raw application/xml + multipart/form-data). pending_oob row
  carries the critical CWE-611 finding template.

  engine/probes/web/stored_xss.py — after the existing POST-then-GET
  echo confirmation, three curated stored-XSS OAST payloads (img
  onerror fetch, sendBeacon, attribute-break) fire as the first
  COMMENT_FIELDS field value under both JSON + form-urlencoded shapes.
  pending_oob row carries the high CWE-79 finding template. Confirms
  when a victim's browser later renders the stored comment and the
  payload calls back to the collaborator.

Blind-RCE wiring deliberately deferred — ptai has no general
command-injection probe today (the closest, deserialization +
nextjs_rsc_rce, are CVE-specific). The Phase-4 payload library
already ships rce_oob_payloads() so a future command-injection probe
gets OOB for free.

86 / 86 in the cross-probe + OOB sweep (SSRF + SQLi + XXE + stored-XSS
+ OOB client/registry/payloads + poll_oob).
Phase 4. CLI surface for the OOB collaborator. Three flags set the
env vars the OOB registry reads:

  --oast-server URL    -> PTAI_OAST_SERVER (default: https://oast.fun)
  --oast-token TOKEN   -> PTAI_OAST_TOKEN (for self-hosted Interactsh)
  --no-oast            -> PTAI_NO_OAST=1 (disable OAST entirely)

Pentesters running on programs that forbid third-party collaborator
infra (the PortSwigger Burp Collaborator policy is canonical) can now
point ptai at a self-hosted Interactsh server in one command:

  ptai start http://target --oast-server https://oast.example.com --oast-token <T>

Or turn OAST off where outbound DNS/HTTP to a collaborator isn't
permitted at all:

  ptai start http://target --no-oast

Flags carry through to MCP run_probe / poll_oob via the env-var seam;
CLI agent-mode probes pick them up the same way once task 0xSteph#11 wires
that path. 67 / 67 CLI tests still green.
Phase 4. Documents the dual-mode collaborator story for blind-vuln
detection: encrypted-at-rest payloads with server-side metadata
visibility, the PortSwigger Collaborator policy as the canonical
"self-host on paid engagements" rule, and the --oast-server /
--no-oast escape hatches.

Section sits under Responsible Use because OAST has the same kind
of operational consequence as scope enforcement: the user owns the
decision about where callback data lands.
Closes Phase 4. Walks the full OOB loop in one test against a mock
Interactsh server stood up on a loopback port:

  register_oob_probe -> mock-interactsh /register (RSA pubkey stored)
                     -> pending_oob row persisted
  mock.queue_interaction(...) — simulates the target firing the payload
  MCP poll_oob       -> mock-interactsh /poll (returns real wire-format
                        AES-CTR ciphertext + RSA-OAEP-wrapped AES key)
                     -> decrypt -> find_pending_oob_by_full_id
                     -> add_finding + mark_pending_oob_matched
                     -> on-disk oob_interaction evidence artifact

Exercises exactly the encryption code path the real oast.fun server
uses (re-uses tests/test_oob_client.py::_fake_poll_response). Confirms
the artifact file content carries the queued interaction's source IP.

Phase 4 deliverable scoreboard:
  audit deal-breaker 0xSteph#2 (blind vulns undetectable) -> CLOSED
  346 / 346 tests in the cross-phase sweep

CHANGELOG entry summarizes the 10 commits comprising Phase 4 (#21-#30
plus the carved-out RCE-wiring follow-up).
Phase 5 (Caido / Burp / ZAP plugin) prereqs. Both surfaced as BLOCKING
by the cross-stream gap audit:

get_findings now accepts:
  - url=<substring>  case-insensitive LIKE match on the target column.
                     Lets a proxy plugin scope the Findings tab to the
                     URL the user is currently inspecting.
  - since=<iso-ts>   created_at >= comparison. Lets a polling Findings
                     tab fetch only what's new since the last refresh
                     instead of re-downloading the full list each tick.

No schema migration needed — the existing target and created_at columns
carry the data. Both filters default to None and combine cleanly with
the existing severity / status filters; backward-compatible.

health() MCP tool:
  Liveness probe for plugin status indicators. Returns
  {status, version, timestamp, uptime_seconds, active_engagements}
  with zero side effects. Never raises — degrades to active=0 when the
  DB is unreachable so the status indicator can show a clear "MCP up,
  DB degraded" state instead of just "MCP unreachable."

Server start time anchored at module import via time.monotonic() so
uptime_seconds is meaningful even after the DB reconnects. Active
engagement count is a single COUNT(*) WHERE status='running' query;
sub-millisecond on any sane row count.

9 new tests (4 url/since combos, 4 health behaviors, 1 backward-compat).
314 / 314 across the MCP + findings_db + OOB + evidence + CLI sweep.
Phase 5 bridge. ptai's MCP server normally speaks JSON-RPC over SSE,
which is the right shape for Claude Code but finicky from a JVM/JS
HTTP client. The new mcp_server/rest.py module mounts four Starlette
routes via FastMCP's @mcp.custom_route() decorator that delegate
straight to the existing tool functions:

  GET  /v1/health           -> health()
  GET  /v1/findings         -> get_findings(engagement_id=, severity=,
                                            status=, url=, since=)
  POST /v1/http_request     -> http_request({engagement_id, method, url,
                                             headers?, body?, json_body?,
                                             auth_profile?,
                                             allow_destructive?})
  GET  /v1/evidence         -> get_evidence(engagement_id=, finding_id=,
                                            include_content=, as_curl=)

No duplicated logic — each route is just unpack-args, call-tool,
JSONResponse-wrap. LocalAuthMiddleware enforces Bearer + Host: header
identically on /v1/* as it does on /sse, so REST consumers (the
Caido / Burp / ZAP plugins shipping out of pentest-ai-extensions) get
the same DNS-rebinding-defense + token-auth posture as MCP consumers.

The plugin client at pentest-ai-extensions/caido/packages/frontend/
src/ptai/client.ts already targets these exact paths — v0.0.1 of the
plugin starts functioning end-to-end the moment this ships.

10 dedicated tests covering auth, host allowlist, all four routes,
JSON body parsing, error shapes. 324/324 across the full MCP +
findings_db + evidence + OOB + CLI regression sweep.
Task 0xSteph#11 — closes the carry-forward from Phase 1. ptai's standalone
"ptai start" agent-mode path (cli/main.py:start, agent_mode=True)
drives probes via engine.agents.handlers.registry_bridge, which builds
its own aiohttp.ClientSession deep in the call chain. That session
never carried the _ptai_extras shape the HTTP primitives chokepoint
looks for, so every Phase 1+4 capability was MCP-path-only:

  - no evidence_artifacts on findings from the CLI path
  - --upstream-proxy / PTAI_UPSTREAM_PROXY had no effect on CLI runs
  - OOB-aware probes silently skipped their OAST code path

Now the bridge populates session._ptai_extras with:

  - engagement_id (from WorkingMemory) — unlocks the OOB probes'
    register_oob_probe() call path
  - evidence_collector (rooted at PENTEST_EVIDENCE_DIR) — every
    HTTP call through the primitives chokepoint persists request +
    response bytes
  - proxy (from PTAI_UPSTREAM_PROXY if set) — Phase-2 stealth proxy
    passthrough now works for ptai start --upstream-proxy <url>

After the probe returns, _attach_pending_evidence_to_findings(session,
findings) drains the pending-evidence buffer and attaches the artifact
summaries to every finding the probe emitted — same orchestrator-side
auto-attach the MCP run_probe path does.

Intensity-derived stealth knobs (ua_rotation, jitter_ms) deferred —
WorkingMemory doesn't carry intensity today; small refactor follows
when needed. The big-ticket items (evidence + OOB + proxy) all work
on the CLI path now.

Legacy AgentOrchestrator (--legacy-pipeline) path still TODO; that
goes through agents/web/web_agent.py + agent-specific session
creation. Not on the critical path since agent_mode is the default
(--agent-mode/--legacy-pipeline). Tracked as follow-up.

One new test in tests/test_registry_bridge.py asserts the extras
attach + buffer-drain wiring works end-to-end. 58/58 in the
registry_bridge + agent_loop + working_memory + handler sweep.
Full-suite run revealed two flakes in test_mcp_rest_adapter.py that
passed in isolation but failed when another async-test file ran first
in the session. Cause: the tests used the deprecated
asyncio.get_event_loop().run_until_complete() pattern to seed the
FindingsDB before driving TestClient. Under Python 3.13 + pytest-asyncio
mode=AUTO, get_event_loop() raises 'no current event loop in thread'
when a prior test closed the loop.

Fix: convert both tests to @pytest.mark.asyncio + plain `await` for
the seeding, then drive TestClient inside the same async test (TestClient
spins its own loop internally — no conflict with the outer pytest-asyncio
loop because we never wait on TestClient's response stream from inside
the running coroutine; the calls are synchronous from Python's
perspective).

Pure test-hygiene change, no production code touched. 21/21 across
test_mcp_rest_adapter + test_findings_db_get_findings_filters +
test_oob_end_to_end + test_mcp_health + test_evidence_integration_e2e
(the cross-file async-DB load that previously triggered the flake).
Headline release closing 3 of 4 deal-breakers from the pentester audit:

  Phase 1 — Evidence bundle. Every finding carries verbatim request
            + response bytes, SHA-256 integrity hash, and a copy-pasteable
            curl one-liner. SARIF gains DAST webRequest/webResponse so
            GitHub Code Scanning renders the exchange inline.

  Phase 2 — Real intensity=stealth. UA rotation across 7 curated modern
            browser UAs, per-request jitter, upstream proxy passthrough
            via --upstream-proxy / PTAI_UPSTREAM_PROXY.

  Phase 4 — OOB collaborator. Interactsh client (RSA-OAEP + AES-CTR
            wire format verified against upstream), curated per-DBMS /
            per-engine payload library (SSRF, blind SQLi, XXE, RCE,
            stored XSS, SSTI, Log4Shell), poll_oob MCP tool, OAST
            payloads wired into ssrf_cloud_metadata + sqli_fuzz +
            xxe_upload + stored_xss.

Plus the plugin-client foundation:

  - MCP auth: per-install token file (~/.pentest-ai/mcp-token, 0600)
    + Host: allowlist for DNS-rebinding defense
  - REST adapter at /v1/health, /v1/findings, /v1/http_request,
    /v1/evidence — lets JVM/JS proxy plugins consume ptai without
    SSE+JSON-RPC
  - get_findings(url=, since=) + new health() MCP tool
  - CLI agent-mode parity: ptai start probes now carry evidence the
    same way MCP-driven probes do

39 commits since 0.15.3. Smoke verified against TaskFlow honeypot:
8/8 findings carry evidence_artifacts, 406/406 on-disk artifacts emit
valid curl reproducers, SARIF webRequest+webResponse populated, REST
/v1/* returns 200, CLI parity confirmed.
…ovider (0xSteph#12)

Issue 0xSteph#12 reporters were hitting "agent_mode: NNN action handlers
registered" then silent exit. The Ollama OLLAMA_HOST fix in 73fef36
addressed a real bug (factory.py read OLLAMA_BASE_URL instead of the
canonical OLLAMA_HOST env var), but it didn't help these users — the
CLI agent-mode driver (engine.agents.anthropic_agent.AnthropicAgent)
hardcodes the Anthropic Messages API surface (client.messages.create)
and is constructed unconditionally via `client = AsyncAnthropic()` at
cli/main.py:729.

So a user setting PENTEST_AI_LLM_PROVIDER=ollama and OLLAMA_HOST was:

  1. Past _llm_key_present() (which honors ollama)
  2. Hitting AsyncAnthropic() — no ANTHROPIC_API_KEY in env
  3. First LLM call failing inside AnthropicAgent.decide_next_action,
     which catches the exception and returns Action(name="finish"),
     terminating the loop cleanly — under the spinner — with no
     surfaced error.

Fix: validate before the Progress spinner starts. When agent_mode is
active and PENTEST_AI_LLM_PROVIDER is set to something other than
"anthropic", exit 4 with a clear message pointing the user at:

  - Running ptai over MCP (Path 1/2) — every provider works there
  - --no-llm for the deterministic wrapped-tools path
  - PENTEST_AI_LLM_PROVIDER=anthropic + ANTHROPIC_API_KEY

Same gate also fires when PENTEST_AI_LLM_PROVIDER=anthropic is explicit
but ANTHROPIC_API_KEY is missing (the existing _llm_key_present gate
only catches the "no provider configured at all" case; this catches the
"provider chosen, key forgotten" case).

Two dedicated regression tests in tests/test_cli.py. 69/69 in the CLI
sweep. Native multi-provider CLI agent-mode (so ollama/openai/litellm
actually work on Path 3) tracked as a separate follow-up.
Supersedes the loud-failure guards from 6fc6d11. The previous fix told
non-Anthropic users their setup was unsupported; this fix actually
makes their setup work.

Three changes wire ptai's CLI agent-mode through the existing
provider-agnostic LLMClient factory so OpenAI / Ollama / LiteLLM users
hit a functioning agent loop instead of a silent hang:

  cli/main.py
    Replace the hard-coded `AsyncAnthropic()` construction with
    `create_llm_client()`. The factory honours PENTEST_AI_LLM_PROVIDER,
    auto-detects the provider from whichever API key is set, and
    returns a unified LLMClient. Removed the now-redundant loud-failure
    guards added in 6fc6d11.

  engine/agents/anthropic_agent.py
    Duck-type the client in decide_next_action. If the client exposes
    .complete() it's the unified LLMClient — call it with LLMMessage
    objects and read resp.content directly. Otherwise fall back to
    the legacy client.messages.create() path so the 8 tests that pass
    MagicMock with messages.create stubbed keep working. The fallback
    is the only thing keeping the class name accurate for now; the
    rename to MultiProviderAgent (or similar) is a later cleanup.

  engine/llm/factory.py
    Auto-detect provider from available keys when PENTEST_AI_LLM_PROVIDER
    isn't set: ANTHROPIC_API_KEY -> anthropic, OPENAI_API_KEY -> openai,
    neither -> openai (fallback to current default). Closes the
    "I set OPENAI_API_KEY and got nothing" foot-gun poeylizn hit on
    the original issue thread.

Verified end-to-end against a real local Ollama instance running
qwen2.5-coder:7b:
  - Factory routes to OllamaProvider (correct base_url + model)
  - create_llm_client() wraps with cost-tracking; .complete() exposed
  - LLMClient.complete(...) round-trips through Ollama; got real reply
  - AnthropicAgent.decide_next_action uses the new branch, returns a
    real Action(name='probe.test', ...) — not finish-due-to-failure

Tests updated: two replaced (test_start_agent_mode_uses_factory_for_*)
to assert the factory routing for OpenAI and Ollama users. 104/104
across tests/test_cli + test_anthropic_agent + test_llm + test_agent_loop
in scope.

Native Anthropic SDK fallback path stays so existing tests + users
with bare AsyncAnthropic clients keep working without changes.
Patch release with the real fix for issue 0xSteph#12 (silent exit in CLI
agent-mode when PENTEST_AI_LLM_PROVIDER is non-Anthropic). 0.16.0
shipped an adjacent OLLAMA_HOST factory fix but missed the actual
root cause; 0.16.1 wires CLI agent-mode through the unified LLM
factory so OPENAI_API_KEY / OLLAMA_HOST / PENTEST_AI_LLM_PROVIDER=ollama
users actually work on Path 3 instead of being silently ignored.

Verified end-to-end against a live local Ollama before tagging.
…docs

Issue 0xSteph#12 follow-up. poeylizn was on 0.16.1 pointing ptai at DeepSeek
in the cloud and got a 404 because the openai-path factory still
asked for gpt-4o regardless of what their endpoint served. The
PENTEST_AI_MODEL env var existed but only the LiteLLM path honored
it; the openai / anthropic / ollama paths used their own hardcoded
defaults.

Now all four provider paths honor PENTEST_AI_MODEL. Same recipe works
for DeepSeek, Groq, Together AI, local llama.cpp / vLLM / LM Studio,
and any other OpenAI-compatible endpoint.

Also adds docs/llm-providers.md with concrete env-var recipes per
provider (Anthropic, OpenAI + compatible, Ollama, LiteLLM with
Azure / Bedrock / Vertex / OpenRouter examples), troubleshooting,
and the --no-llm escape hatch. Linked from README Path 3.

5 new tests in tests/test_llm.py covering PENTEST_AI_MODEL across
all four providers + explicit-arg-beats-env precedence. 108/108
across the touched paths.

Verified live against running Ollama: PENTEST_AI_MODEL=qwen2.5-coder:7b
routes correctly through the factory + the unified LLMClient + a real
completion call.
…bodies

Issue 0xSteph#12 follow-up. poeylizn reported a 400 from DeepSeek-cloud that
read "Client error '400 Bad Request' for url '...'" with no upstream
reason. Turned out his actual problem was the model-name gap that
0.16.2 fixed (he resolved it by upgrading), but the unreadable error
exposed a real engineering miss: all four providers were calling
httpx response.raise_for_status() which throws away the response body.

Custom-base-URL users (DeepSeek cloud, Groq, Together AI, vLLM, etc.)
got no diagnostic when their endpoint rejected a request, even when
the upstream's JSON error body would have told them exactly what to
fix.

0.16.3 changes:

  engine/llm/providers/openai.py — new LLMHTTPError class. When
  response.status_code >= 400, raise LLMHTTPError with endpoint URL,
  model name, status code, and the upstream response body (truncated
  to 2 KB).

  engine/llm/providers/anthropic.py — same fix, reuses LLMHTTPError.
  engine/llm/providers/ollama.py    — same.

  docs/llm-providers.md — troubleshooting section updated to walk
  through the new error format, common DeepSeek 400 causes (model
  rename, missing /v1 suffix, key whitespace), and a no-network
  factory-config preflight (python -c "from engine.llm.factory ...").

End-to-end verified live: PENTEST_AI_LLM_PROVIDER=openai pointed at
Ollama's /v1/ with a nonexistent model now raises

  LLMHTTPError: OpenAI-compatible endpoint at http://localhost:11434/v1
  returned HTTP 404 for model='this-model-does-not-exist':
  {"error":{"message":"model 'this-model-does-not-exist' not found",
  "type":"not_found_error",...}}

instead of the old opaque "Client error '404' for url '...'". Same
endpoint with a valid model name completes normally — the openai
provider's custom-base-URL path was never broken, just unhelpful when
the upstream said no.
Two related cleanups in one commit. Both have been red on every main
push for the past day or two; the per-commit pentest-ai workflow's
lint job is catching real quality drift.

1. Ruff (34 errors -> 0):
   - 26 auto-fixed by `ruff check --fix`: unused imports (F401),
     unsorted import blocks (I001), quoted annotations (UP037).
   - 5 manual SIM105 suppressible-exception sites in probe-edge
     code (sqli_fuzz, stored_xss, xxe_upload) intentionally use
     try/except/pass to keep probe edges resilient against arbitrary
     network failures; tagged with `# noqa: SIM105`.
   - 2 auth_local.py OSError suppressors converted to
     contextlib.suppress (clean fit, narrow exception).
   - 1 F811 shadowing in mcp_server/server.py (local `rest` var
     collided with `from mcp_server import rest`); renamed local
     to `body`.
   - 2 F841 unused-variable cases in tests (`e2`, `old`); removed
     the bindings.

2. AI-typography cleanup in user-facing docs:
   - Replaced em-dashes (74 total) with hyphens across CHANGELOG.md,
     README.md, docs/llm-providers.md.
   - Same pass dropped Unicode arrows (->), ellipses (...), and
     en-dashes from this session's additions. Older README emoji,
     badges, and intentional legacy math symbols (>=, x) preserved.

Mypy was already clean. 100 tests across the touched paths still
green after the auto-fixes.
The v0.16.3 release.yml test job died on a TypeError: three existing
Anthropic provider tests mocked `response` with bare MagicMocks that
didn't set status_code, so the new `response.status_code >= 400`
check raised against the mock instead of returning False. Result:
0.16.3 the tag exists in git, but 0.16.3 the wheel never reached PyPI.

This release re-cuts the same LLM-provider-error-body fix on top of
the fixed test fixtures + the ruff cleanup + the em-dash sweep that
landed in between. Six tests now set fake_response.status_code = 200
explicitly; 40 / 40 in the touched files green.

User-visible behavior is identical to what 0.16.3 was supposed to
ship: LLMHTTPError raised on >=400 with endpoint, model, status code,
and the upstream's response body (truncated 2 KB) so DeepSeek / Groq
/ vLLM / Together-AI users can actually diagnose 4xx instead of
seeing httpx's opaque one-liner.
… loudly

v0.17.0 Change 1. Closes the silent-zero-findings failure mode from issue
0xSteph#12 where DeepSeek emitted free-form text or made-up handler names and the
loop accepted the parser's fallback finish without warning.

- ResponseQuality enum (VALID / UNPARSEABLE / UNKNOWN_HANDLER) classified
  by AnthropicAgent._parse_action; agent stashes self.last_quality so the
  LLMAgent Protocol signature stays unchanged.
- WorkingMemory.consecutive_bad_responses counts non-VALID responses in a
  row; resets on a clean response.
- LoopConfig.bad_response_threshold (default 3, env-overridable via
  PTAI_AGENT_FALLBACK_THRESHOLD) trips the deterministic fallback with
  exit_reason="llm_non_cooperative".
- Early-finish detection: finish at iter < min_iterations_before_finish
  with 0 findings now runs the deterministic fallback (exit_reason
  "llm_finished_too_early"). Previously the loop silently continued,
  which is exactly how poeylizn ended with 0 findings.
- min_iterations_before_finish now env-overridable via
  PTAI_MIN_ITERATIONS_BEFORE_FINISH.
- CLI post-engagement summary prints a NOTE warning when exit_reason is
  one of the non-cooperative or fallback outcomes, so users know the
  LLM was the bottleneck (or the source) of any findings.

Tests: ResponseQuality classification for valid/unparseable/unknown
handler/swallowed-exception paths; threshold trip + counter reset;
early-finish-with-zero vs early-finish-with-findings divergence; env
override behaviour.
…nge 2)

Closes the second failure mode behind issue 0xSteph#12: factory returns a client
that LATER fails mid-loop with no diagnostic. Now bad configs fail loud at
startup with a concrete next-step block, and 'ptai doctor' lets users
sanity-check their config without running a scan.

Provider.validate() (LLMClient Protocol addition):
- Anthropic: GET /v1/models with key
- OpenAI / OpenAI-compat: GET {base_url}/models; on 404 fall back to a
  1-token complete() probe (covers vLLM, llama.cpp, private deployments
  that don't ship /models)
- Ollama: GET /api/tags + check the configured model is in the tag list,
  accepts both 'llama3.1' and 'llama3.1:latest' style names
- LiteLLM: 1-token complete() (only universal preflight for 300+ backends)
- CostTrackingLLMClient passes validate() through

Factory:
- Auto-detect chain now probes OLLAMA_HOST (or default localhost:11434)
  for /api/tags via a 500ms sync httpx call when no cloud API keys are
  set. Closes as8ASd3's report where OLLAMA_HOST was set but
  PENTEST_AI_LLM_PROVIDER wasn't, so the factory fell through to
  Anthropic with no key.
- New async validate_client(client) helper with hard timeout from
  PTAI_FACTORY_VALIDATE_TIMEOUT_MS (default 2000). On timeout: warn and
  continue (best-effort preflight). On LLMUnavailableError: propagate
  (auth failures always fail loud). Skippable via PTAI_FACTORY_VALIDATE=0
  for one release.

CLI:
- 'ptai start --agent-mode' now awaits validate_client() after factory
  construction; exits 1 with the next-step block on auth failure.
- New 'ptai doctor' command prints resolved provider/model/endpoint,
  env-var surface (keys masked), storage paths, and a live preflight
  result. Exits 0 on success, 1 on validate failure. Distinct from
  cli/menu.py:_doctor (install-audit shell wrapper).

Tests:
- tests/test_llm_factory_validate.py: per-provider validate paths
  (200 / 401 / 404 fallback / connection-refused / model-missing);
  validate_client timeout-warns-and-continues + auth-propagates +
  env-skip behavior.
- tests/test_llm_factory_autodetect.py: OLLAMA_HOST probe drives the
  auto-detect path; explicit provider wins; cloud keys still win first.
- tests/test_cli_doctor.py: doctor exit codes, section presence, key
  masking, Ollama endpoint surfacing.
- tests/test_agent_mode_cli.py: now sets PTAI_FACTORY_VALIDATE=0 so the
  fake API key in the test doesn't 401 against real Anthropic.
…ge 4a)

Adds scripts/diag_post_engagement_hang.py: a SIGUSR1-driven faulthandler
wrapper that lets us inspect which thread is keeping the interpreter
alive after "Engagement complete". No production code changes.

The script can either exec a ptai command directly (and forward the
PID + signal recipe to stderr) or print the recipe for an already-
running PID. When kill -USR1 hits the registered handler, every
thread's stack dumps to stderr; the named thread IS the bug.

Use:
  python scripts/diag_post_engagement_hang.py --exec \
      "ptai start http://localhost:3000 --no-llm --no-oast"
  # in another terminal once the spinner stops moving:
  kill -USR1 <pid>

Full investigation notes (which threads, ranked candidates, the
empirical isolated httpx reproducer, recommended fix shape for 4b)
live in the sibling notes repo at pentest-ai-notes/ per the
.gitignore rule excluding notes/ from this public tree.

Change 4b will add the targeted close() call(s) the harness pointed
at + a watchdog backstop, gated by PTAI_FORCE_EXIT_DISABLE /
PTAI_FORCE_EXIT_SECS.
…ng (v0.17.0 Change 4b)

Closes the post-engagement hang reported by poeylizn on issue 0xSteph#12.

Root cause (per Change 4a investigation):
The LLM client (an httpx.AsyncClient inside each provider) is created in
cli/main.py:_run_engagement() but never aclose()'d. Once any request
has been made through it, the connection pool keeps the interpreter
blocked on threading._shutdown waiting for non-daemon transport
machinery to wind down. That's the lock.acquire() traceback poeylizn
posted.

Targeted fix:
- Hoist llm_client out to function scope so the finally block can see it.
- Add `await llm_client.close()` to the existing finally chain at
  cli/main.py:828, right next to the established `await db.close()` and
  cache.close() pattern. No new shutdown framework.

Watchdog backstop (env-gated):
- New _arm_force_exit_timer() helper: daemon thread that sleeps for
  PTAI_FORCE_EXIT_SECS (default 5) then calls os._exit(0) after logging
  any still-alive non-daemon threads. Catches future regressions where
  some new code path holds a resource we haven't seen yet.
- Armed at the end of `start` in both code paths (the --ci early-return
  and the panel-then-sync path).
- Skippable via PTAI_FORCE_EXIT_DISABLE=1 so we can verify the root-cause
  fix is what's doing the work (not the watchdog masking a regression).

Tests (tests/test_cli_start_clean_exit.py):
- _arm_force_exit_timer: spawns daemon thread; honours DISABLE=1;
  invalid PTAI_FORCE_EXIT_SECS doesn't crash; SECS=0 short-circuits;
  end-to-end subprocess test confirms os._exit(0) fires after the
  configured delay.
- Static-contract guards: the llm_client init + close() and the
  _arm_force_exit_timer call sites must stay in cli.main.

Subprocess-against-honeypot integration test is deferred to the
Change 3 CI matrix (next commit) where it runs against a real Juice
Shop sidecar with --no-llm. That cell will assert exit_reason +
exit-within-N-seconds end-to-end.
Closes the loop on future issue-12-shaped reports where the reporter
was N versions behind a release that already fixed their bug. Every
'ptai start' and 'ptai doctor' run now nags if a newer stable is on
PyPI, modelled on Claude Code's update-available banner.

cli/_version_check.py (new):
- maybe_nag(current, *, console, deadline_ms=200) - synchronous entry,
  never raises, never blocks more than the deadline.
- Daemon-thread worker hits the PyPI JSON endpoint with a 1s hard
  timeout. Foreground polls a queue with deadline_ms. If the worker
  doesn't beat the deadline, the nag is skipped THIS run; the worker
  still writes the cache so the NEXT run benefits. That's how we get
  "never blocks startup" while still making progress.
- Cache: ~/.pentest-ai/version-check.json with 24h TTL. Same dir as
  findings.db and the evidence dir (no new top-level state location).
- Skips on: PTAI_SKIP_VERSION_CHECK=1, cache fresh, PyPI unreachable,
  PyPI returns non-200, the current local version reads as 'unknown'
  (dev install without metadata).
- PTAI_VERSION_OVERRIDE - test-only knob in the same env-gate pattern
  as the rest of v0.17.0. Lets users manually trigger the nag for
  verification without rebuilding.
- Pre-releases (a/b/rc/dev/alpha/beta) and yanked releases excluded
  when picking 'latest stable'.

cli/main.py wires it into exactly two call sites:
- 'ptai start' before the AUP gate so the user sees Update-available
  ahead of the engagement banner.
- 'ptai doctor' right after the version header.

Tests (tests/test_version_check.py, 11 cases):
- happy path: newer stable -> nag with the pipx upgrade line
- same version -> no nag
- PyPI 500 / OSError -> no nag, no exception
- PTAI_SKIP_VERSION_CHECK=1 -> zero HTTP calls
- PTAI_VERSION_OVERRIDE wins over the importlib.metadata value
- fresh cache (<24h) -> zero HTTP calls
- stale cache + 5s-slow PyPI mock -> returns under 500ms, no nag,
  worker still completes in background
- pre-releases not picked as 'latest'
- yanked releases not picked as 'latest'
- unknown local version -> silent skip

All tests use mocks; the module never makes a real network call during
pytest. The cache file is redirected into tmp_path per test so the
user's real ~/.pentest-ai/version-check.json stays untouched.
…nge 3)

Closes the CI coverage hole that let issue 0xSteph#12 ship: the release gate
was only testing Claude-driven Juice Shop, the easiest possible diagonal.
The build can now only go green if both:

  - ollama cell: the agent loop drives Juice Shop via a local
    qwen2.5-coder:7b sidecar and emits at least 1 finding. Catches the
    silent-zero-findings outcome directly.
  - deterministic cell: PTAI_NO_LLM=1 path runs cleanly. Confirms the
    fallback Change 1 triggers actually produces a valid engagement.

Both cells share the same Juice Shop service container, same lifecycle
test file, same exit-reason allowlist. fail-fast=false so a single-cell
flake doesn't mask the other.

The plan-prescribed 4x2 provider x target matrix is intentionally NOT
in scope. One non-Anthropic cell catches the regression class we have
evidence of; expanding is a follow-up once we have baseline data.

Anthropic is not added to the matrix - it would require a paid CI
secret and the existing path is already exercised by maintainer test
runs.

Ollama model layer cache (per Gap 4):
- actions/cache step caches ~/.ollama/models keyed
  ollama-models-qwen2.5-coder-7b-v1. -v1 is a manual cache-bust knob.
- Cache restore turns the 4.7 GB pull instant on subsequent runs;
  cold-cache runs still work, just slower.
- timeout-minutes up from 30 to 40 to absorb the first cold pull.

CLI:
- _ci_print's engagement_complete event now includes exit_reason from
  WorkingMemory so the matrix assertion can gate on it.

Tests:
- tests/test_engagement_lifecycle_e2e.py:test_matrix_cell_exits_cleanly_against_juiceshop
  is a matrix-only test (skips unless PTAI_E2E_MATRIX_CELL is set).
  Spawns ptai start --ci as a subprocess against Juice Shop, streams
  stdout, parses the engagement_complete JSON line, asserts:
    1. process exits within 30s of the JSON banner (Change 4b guarantee)
    2. exit_reason in {finished, coverage, deterministic,
       llm_non_cooperative, llm_finished_too_early}
    3. Ollama cell only: total_findings >= 1
  Deterministic cell is allowed 0 findings because some probes
  legitimately don't trigger against Juice Shop.

Local-skip semantics: the test self-skips when PTAI_E2E_MATRIX_CELL is
unset, so unit-test pytest runs aren't affected. Set the env locally to
run it against an Ollama instance + a Juice Shop container.
…rage gap)

Documents all five v0.17.0 Changes ahead of the version bump:

  - Change 1: garbage + give-up LLM detection -> deterministic fallback
  - Change 2: ptai doctor + factory-time validate() + Ollama auto-detect
  - Change 3: 2-cell release-e2e matrix (ollama + deterministic) with
    model layer cache
  - Change 4a: SIGUSR1 hang-investigation harness
  - Change 4b: LLM client close in engagement finally + watchdog backstop
  - Change 5: PyPI version-check startup nag

Steve cuts the actual version bump + tag.
…ama3.1 default

Live test surfaced this on the v0.17.0 branch: with OLLAMA_HOST set,
no cloud key, no PENTEST_AI_MODEL set, the factory was picking ollama
+ llama3.1 (the static default) regardless of what the user actually
had pulled. Validate then failed loud with 'ollama pull llama3.1',
which is the wrong remediation for a user who has qwen2.5-coder:7b
(or anything else) already pulled.

The v0.17.0 plan called for 'first model in the tag list' in the
auto-detect branch; the original implementation skipped that step.
This commit adds _ollama_first_model() and uses it when both 'model'
and 'env_model' are empty. Falls back to 'llama3.1' only if the probe
fails. Explicit PENTEST_AI_MODEL still wins.

Verified live: doctor against an Ollama with only qwen2.5-coder:7b
pulled now reports 'Resolved provider: ollama, Model: qwen2.5-coder:7b'
and validates OK.
…strator escalation (v0.17.0 Change 1c+1d)

Closes the buyer-blocker that local Ollama tests against the honeypot
surfaced: qwen2.5-coder:7b emits perfectly valid JSON, picks real
registered handlers like meta.set_auth, but its action choices are
unhelpful and the engagement ends with zero findings against a target
that has 63 findings via the orchestrator path.

Two layers stacked:

Change 1c - agent_loop safety net:
  When run_agent_loop ends max_iterations or coverage with zero
  findings AND fallback_to_deterministic=True, run the existing
  _run_deterministic_fallback (probe.* with {} args) and switch
  exit_reason to the new "llm_unproductive". Catches per-loop
  outcomes that the per-response quality check doesn't catch.

Change 1d - cli/main.py orchestrator escalation (the real fix):
  After run_agent_loop returns, check db.get_findings for the
  engagement. If still zero, escalate to AgentOrchestrator on the
  SAME engagement_id. The orchestrator runs the proper phase
  pipeline (recon -> discovery -> probes with real args -> chain
  -> validate), which is what produces real findings against the
  honeypot. agent_loop's internal fallback alone isn't enough -
  it calls probe.* with {} args which produces nothing without
  recon-discovered endpoints.

  exit_reason is tagged "<original>+orchestrator_escalation" so the
  user-facing summary tells them which paths ran. The CLI summary
  surfaces a clear NOTE explaining the LLM was unproductive and the
  orchestrator took over.

Live verification (qwen2.5-coder:7b against tests/honeypot/, 4 agent
iterations then escalation):
  - exit=0 (clean), elapsed 193s
  - 63 findings: 13 critical, 6 high, 2 medium, 4 low, 38 info
  - exit_reason="llm_unproductive+orchestrator_escalation"
  - escalation warning logged with the original exit_reason

Tests:
  - tests/test_agent_loop.py: three new tests for Change 1c
    (safety net fires; counter-case with pre-existing findings;
    disabled when fallback_to_deterministic=False).
  - tests/test_agent_mode_cli.py: db.get_findings now returns a
    non-empty list so the escalation path doesn't fire in this
    test (which is verifying agent-mode dispatch, not escalation).
  - tests/test_engagement_lifecycle_e2e.py: allowed-exit-reason set
    expanded with "llm_unproductive" (the matrix CI cell now
    accepts this outcome).
  - tests/test_cli_start_clean_exit.py: thread-name asserts switched
    to count-based to fix order-sensitivity.
Closes issue 0xSteph#12 (silent-exit + post-engagement hang + zero-findings
against vulnerable targets). Five plan-Changes + the live-test
additions (cooperative-but-unproductive safety net, orchestrator
escalation, auto-detect-first-pulled-model fix).

Live-verified end-to-end with Ollama qwen2.5-coder:7b against the
local honeypot: 63 findings via the escalation path, clean exit
in 193s.

See CHANGELOG [0.17.0] for the full surface.
v0.17.0 tagged but blocked from PyPI publish by the new matrix CI test
itself, not by any production code regression. Two test bugs:

  1. Deterministic cell uses --no-llm (legacy AgentOrchestrator path)
     which doesn't set exit_reason; the assertion rejected None.
  2. Ollama cell hit the 900s per-test timeout because qwen2.5-coder:7b
     on the GitHub Actions runner (no GPU) is slow + the orchestrator
     escalation adds another ~5min on top of the agent loop.

Fixes (test-only, no production code touched):
  - Allow exit_reason=None for the deterministic cell.
  - Strip '+orchestrator_escalation' suffix from exit_reason before
    checking the allowlist (Change 1d adds the suffix when escalation
    fires).
  - Cap --agent-max-iter at 3 for the ollama cell - the escalation
    produces findings anyway, so more iterations just burn time.
  - Per-test timeout bumped to 1800s on the matrix test specifically;
    the 900s file-wide default stays for the original lifecycle tests.

The v0.17.0 production fixes (silent-exit, hang, orchestrator
escalation, doctor, version nag) are unchanged. This release ships
the same code with a CI test that actually passes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants