diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index e788b3f..539d2ef 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -102,6 +102,7 @@ jobs: timeout-minutes: 10 env: PMPROXY_URL: http://localhost:44322 + GRAFANA_URL: http://localhost:3000 steps: - uses: actions/checkout@v4 @@ -125,6 +126,14 @@ jobs: sleep 2 done + - name: Wait for Grafana + run: | + echo "Waiting for Grafana..." + for i in $(seq 1 30); do + curl -sf http://localhost:3000/api/health && break + sleep 2 + done + - name: Test (E2E) run: uv run pytest -m e2e --junitxml=results-e2e.xml diff --git a/CLAUDE.md b/CLAUDE.md index c67b13a..fadf434 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -96,6 +96,39 @@ podman compose down - PCP image **requires `privileged: true`** — it uses systemd as PID 1; without it the container exits immediately (code 255) - Redis host env var is **`KEY_SERVERS: redis-stack:6379`** (NOT `PCP_REDIS_HOST`) — that's what the container entrypoint reads; wrong value causes pmproxy to hang on all series/search calls +- **Podman splits `CMD` array args on semicolons** — Python one-liners with `;` get mangled. Always use `CMD-SHELL` for healthchecks containing semicolons: `["CMD-SHELL", "python -c 'import foo; foo.bar()'"]` + +## Grafana Compose Gotchas + +- PCP plugin is **unsigned** — must use `GF_INSTALL_PLUGINS` with the GitHub release ZIP URL, not the Grafana catalog shorthand +- All PCP sub-plugins must be listed in `GF_PLUGINS_ALLOW_LOADING_UNSIGNED_PLUGINS` (app, valkey-datasource, vector-datasource, bpftrace-datasource, flamegraph-panel, breadcrumbs-panel, troubleshooting-panel) +- mcp-grafana **requires authentication** — it doesn't support anonymous access. Basic auth (`admin/admin`) is simplest for the dev stack +- Grafana healthcheck uses `curl -sf http://localhost:3000/api/health` — the container must have curl installed (official image does) +- Datasources are auto-provisioned from `grafana/provisioning/datasources/pcp.yaml` — mounted read-only into the container + +## Grafana Dashboard Conventions (Investigation Output) + +When creating dashboards as part of an investigation: + +| Convention | Value | +|-----------|-------| +| Folder | `pmmcp-triage` (configurable via `PMMCP_GRAFANA_FOLDER`) | +| Naming | `YYYY-MM-DD ` (e.g., `2026-03-10 memory cascade saas-prod-01`) | +| Tagging | Always include `pmmcp-generated` | +| Deeplink | After creation, call `generate_deeplink` and return URL to user | +| Auto-trigger | Offer visualisation when findings span 3+ metrics or 2+ subsystems | + +## Investigation Prompt Hierarchy + +The investigation prompt hierarchy is: + +``` +session_init → coordinate_investigation → specialist_investigate (×6) +``` + +- **ALWAYS** start broad investigations with `coordinate_investigation` +- **DO NOT** call raw tools (`pcp_fetch_timeseries`, `pcp_detect_anomalies`) directly for open-ended investigations +- Specialist prompts are dispatched by the coordinator — don't call them directly unless targeting a specific subsystem ## CI / Local E2E Parity — CRITICAL @@ -211,16 +244,27 @@ Rules: **Mandatory before any `git push`** (required by Constitution v1.2.0, Principle II). -Run either: +The full check runs: lint → format → unit+integration tests (≥80% coverage) → E2E tests (compose stack + container healthchecks). + +**If you are Claude running in a VM** (no podman/docker available): +- Run `just ci` as a minimum — this covers lint, format, and unit+integration tests +- Do **not** attempt `pre-push-sanity.sh`, `just e2e`, or any `podman compose` commands — they will fail without a container runtime +- Prompt the user to run the full suite on their host before pushing: + ``` + ⚠️ I've run `just ci` (lint + tests) — all green. + Please run `./pre-commit.sh` or `just e2e` on your host to complete E2E validation before pushing. + ``` + +**If you have container access** (or the user is running directly): ```bash -scripts/pre-push-sanity.sh +./pre-commit.sh ``` or invoke the Claude skill: ``` /pre-push-sanity ``` -The check runs in order: lint → format → unit+integration tests (≥80% coverage) → E2E tests (starts compose stack automatically via `just e2e`). E2E is **never skipped** — the compose stack must be buildable and all containers must pass healthchecks before tests run. +E2E is **never skipped** by humans — the compose stack must be buildable and all containers must pass healthchecks before tests run. ## Active Technologies @@ -232,6 +276,8 @@ The check runs in order: lint → format → unit+integration tests (≥80% cove - Python 3.11+ + `mcp[cli]` ≥1.2.0 (FastMCP), `pydantic` v2.x — no new dependencies (010-specialist-agents) - N/A — prompts are stateless text generators (010-specialist-agents) - Python 3.11+ + `mcp[cli]` ≥1.2.0 (FastMCP), `pydantic` v2.x — no new dependencies (011-specialist-baselining) +- N/A (infrastructure-only; compose YAML, Grafana provisioning YAML) + `grafana/grafana:latest`, `mcp/grafana` (Docker Hub), `performancecopilot-pcp-app` plugin v5.3.0 (012-grafana-compose) +- Ephemeral — no persistent volumes for Grafana (012-grafana-compose) ## Recent Changes - 002-add-integration-e2e-tests: Added Python 3.11+ + `mcp[cli]` ≥1.26.0 (FastMCP + ClientSession), `anyio` (memory streams), `respx` (already present — mocks httpx for integration tier), `pytest-asyncio` (already present) diff --git a/Justfile b/Justfile index 31cca18..0ed2c05 100644 --- a/Justfile +++ b/Justfile @@ -35,5 +35,13 @@ ci: check test # Uses --wait to match CI behaviour — all containers must be healthy before tests run e2e: PROFILES_DIR=./profiles/e2e podman compose up -d --wait --wait-timeout 120 - PMPROXY_URL=http://localhost:44322 uv run python -m pytest -m e2e -q + PMPROXY_URL=http://localhost:44322 GRAFANA_URL=http://localhost:3000 MCP_GRAFANA_URL=http://localhost:8000 uv run python -m pytest -m e2e -q @echo "Stack still running — run 'podman compose down --volumes' to purge seeded data before next run" + +# Brings up the full stack, seeded (not e2e) +doit: + podman compose up -d --wait --wait-timeout 120 + +# Removes all containers and their volumes for a clean state +teardown: + podman compose down --volumes \ No newline at end of file diff --git a/README.md b/README.md index da86098..4aedca6 100644 --- a/README.md +++ b/README.md @@ -19,12 +19,14 @@ pmproxy's time-series backend, and has everything ready for Claude to analyse. podman compose up -d ``` -This runs four services in order: +This runs six services in order: 1. **pmlogsynth-generator** — generates PCP archives from `profiles/scenarios/saas-diurnal-week.yml` 2. **redis-stack** — time-series backend (Valkey/Redis, port 6379) 3. **pmlogsynth-seeder** — loads the archives into the time-series store 4. **pcp** — pmcd + pmproxy, ready to serve queries (port 44322) +5. **grafana** — Grafana with PCP plugin and auto-provisioned datasources (port 3000) +6. **mcp-grafana** — MCP server for Grafana, SSE transport (port 8000) The generator and seeder are one-shot jobs; allow ~30–60 seconds for them to complete. Check progress with: @@ -39,7 +41,13 @@ Once seeded, verify data is queryable: curl -s "http://localhost:44322/series/query?expr=kernel.all.cpu.user" | head -c 200 ``` -### 2. Connect pmmcp to Claude Code +### 2. Browse Grafana dashboards + +Open http://localhost:3000 — no login required (anonymous admin is enabled). Navigate to **Connections → Data sources** to see the auto-provisioned PCP Valkey (historical) and PCP Vector (live) datasources. + +The `mcp-grafana` service exposes a Grafana MCP server at http://localhost:8000/sse for AI agents that need to create dashboards or query Grafana programmatically. + +### 3. Connect pmmcp to Claude Code ```bash git clone @@ -62,7 +70,7 @@ Add to `.mcp.json` in your project root (or `~/.claude/mcp.json` for global conf Restart Claude Code (or `/mcp` to reload) and confirm **pmmcp** appears in the connected servers list. -### 3. Ask Claude to investigate +### 4. Ask Claude to investigate The seeded dataset is `saas-prod-01` — a simulated production host with a week of realistic diurnal traffic. Try these to get a feel for what pmmcp can do: @@ -93,7 +101,7 @@ Compare the morning peak to the overnight baseline on saas-prod-01 across CPU, m /investigate_subsystem subsystem=cpu host=saas-prod-01 ``` -### 4. Tear down when done +### 5. Tear down when done ```bash podman compose down --volumes @@ -189,6 +197,8 @@ See **Running pmmcp** below for all CLI flags and environment variables. | `PMMCP_TRANSPORT` | `stdio` | MCP transport mode | | `PMMCP_HOST` | `127.0.0.1` | Bind host for HTTP transport | | `PMMCP_PORT` | `8080` | Bind port for HTTP transport | +| `PMMCP_GRAFANA_FOLDER` | `pmmcp-triage` | Grafana folder for investigation dashboards | +| `PMMCP_REPORT_DIR` | `~/.pmmcp/reports` | Output directory for HTML fallback reports | **Precedence:** CLI flag > environment variable > default. diff --git a/docker-compose.yml b/docker-compose.yml index c038537..5593a66 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -78,7 +78,47 @@ services: pcp: condition: service_started healthcheck: - test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:8080/healthcheck')"] + # Liveness only — confirms HTTP server accepts connections (ignores pmproxy status) + test: ["CMD-SHELL", "python -c 'import socket; socket.create_connection((\"localhost\",8080),2).close()'"] interval: 10s timeout: 5s - retries: 3 + retries: 6 + start_period: 10s + + # Grafana with PCP plugin — browse http://localhost:3000 (anonymous admin, no login) + grafana: + image: grafana/grafana:latest + ports: + - "3000:3000" + environment: + # PCP plugin is unsigned — install from GitHub release ZIP and allow all its sub-plugins + GF_INSTALL_PLUGINS: "https://github.com/performancecopilot/grafana-pcp/releases/download/v5.3.0/performancecopilot-pcp-app-5.3.0.zip;performancecopilot-pcp-app" + GF_PLUGINS_ALLOW_LOADING_UNSIGNED_PLUGINS: "performancecopilot-pcp-app,performancecopilot-valkey-datasource,performancecopilot-vector-datasource,performancecopilot-bpftrace-datasource,performancecopilot-flamegraph-panel,performancecopilot-breadcrumbs-panel,performancecopilot-troubleshooting-panel" + # Anonymous admin for browser access; basic auth creds for mcp-grafana API access + GF_AUTH_ANONYMOUS_ENABLED: "true" + GF_AUTH_ANONYMOUS_ORG_ROLE: Admin + GF_SECURITY_ADMIN_USER: admin + GF_SECURITY_ADMIN_PASSWORD: admin + volumes: + - ./grafana/provisioning:/etc/grafana/provisioning:ro + depends_on: + pcp: + condition: service_started + healthcheck: + test: ["CMD", "curl", "-sf", "http://localhost:3000/api/health"] + interval: 10s + timeout: 5s + retries: 12 + + # mcp-grafana — MCP server for Grafana, SSE transport on http://localhost:8000/sse + mcp-grafana: + image: mcp/grafana + ports: + - "8000:8000" + environment: + GRAFANA_URL: http://grafana:3000 + GRAFANA_USERNAME: admin + GRAFANA_PASSWORD: admin + depends_on: + grafana: + condition: service_healthy diff --git a/docs/superpowers/plans/2026-03-10-investigation-hierarchy-guardrails.md b/docs/superpowers/plans/2026-03-10-investigation-hierarchy-guardrails.md new file mode 100644 index 0000000..2368bd5 --- /dev/null +++ b/docs/superpowers/plans/2026-03-10-investigation-hierarchy-guardrails.md @@ -0,0 +1,592 @@ +# Investigation Hierarchy Guardrails — Implementation Plan + +> **For agentic workers:** REQUIRED: Use superpowers:subagent-driven-development (if subagents available) or superpowers:executing-plans to implement this plan. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Guide Claude toward the coordinator prompt instead of bypassing the investigation hierarchy, and add Grafana visualisation as Phase 3 of the coordinator workflow. + +**Architecture:** Docstring additions to 4 tool functions, prompt content changes to 3 prompt modules, and 2 new config fields. No new tools, no new files beyond tests. All changes are text/config — no behavioural changes to existing tool logic. + +**Tech Stack:** Python 3.11+, pydantic-settings, pytest + +--- + +## Chunk 1: Config & Tool Docstrings + +### Task 1: Add `grafana_folder` and `report_dir` config fields + +**Note:** The design spec says `PmproxyConfig` but these are server-level settings (Grafana +folder, report output dir), not pmproxy connection settings. `ServerConfig` (env prefix +`PMMCP_`) is the correct home — `PmproxyConfig` (env prefix `PMPROXY_`) is for pmproxy +connection parameters only. + +**Files:** +- Modify: `src/pmmcp/config.py` +- Create: `tests/unit/test_config.py` + +- [ ] **Step 1: Write failing tests for new config fields** + +```python +# tests/unit/test_config.py — create this file + +def test_server_config_grafana_folder_default(): + """PMMCP_GRAFANA_FOLDER defaults to 'pmmcp-triage'.""" + from pmmcp.config import ServerConfig + cfg = ServerConfig() + assert cfg.grafana_folder == "pmmcp-triage" + + +def test_server_config_grafana_folder_env_override(monkeypatch): + """PMMCP_GRAFANA_FOLDER can be overridden via env var.""" + from pmmcp.config import ServerConfig + monkeypatch.setenv("PMMCP_GRAFANA_FOLDER", "my-triage") + cfg = ServerConfig() + assert cfg.grafana_folder == "my-triage" + + +def test_server_config_report_dir_default(): + """PMMCP_REPORT_DIR defaults to ~/.pmmcp/reports/.""" + from pmmcp.config import ServerConfig + from pathlib import Path + cfg = ServerConfig() + assert cfg.report_dir == Path("~/.pmmcp/reports") + + +def test_server_config_report_dir_env_override(monkeypatch): + """PMMCP_REPORT_DIR can be overridden via env var.""" + from pmmcp.config import ServerConfig + from pathlib import Path + monkeypatch.setenv("PMMCP_REPORT_DIR", "/tmp/reports") + cfg = ServerConfig() + assert cfg.report_dir == Path("/tmp/reports") +``` + +- [ ] **Step 2: Run tests to verify they fail** + +Run: `uv run pytest tests/unit/test_config.py -v -k "grafana_folder or report_dir"` +Expected: FAIL — `ServerConfig` has no `grafana_folder` or `report_dir` attributes + +- [ ] **Step 3: Add fields to ServerConfig** + +In `src/pmmcp/config.py`, add two fields to `ServerConfig`: + +```python +class ServerConfig(BaseSettings): + transport: Literal["stdio", "streamable-http"] = "stdio" + host: str = "127.0.0.1" + port: int = 8080 + grafana_folder: str = "pmmcp-triage" + report_dir: Path = Path("~/.pmmcp/reports") + + model_config = {"env_prefix": "PMMCP_"} +``` + +- [ ] **Step 4: Run tests to verify they pass** + +Run: `uv run pytest tests/unit/test_config.py -v -k "grafana_folder or report_dir"` +Expected: PASS (4 tests) + +- [ ] **Step 5: Commit** + +```bash +git add src/pmmcp/config.py tests/unit/test_config.py +git commit -m "feat: add grafana_folder and report_dir config fields + +Per issue #10 — configurable Grafana folder (default pmmcp-triage) +and HTML fallback report directory (default ~/.pmmcp/reports)." +``` + +--- + +### Task 2: Add coordinator breadcrumbs to tool docstrings + +**Files:** +- Modify: `src/pmmcp/tools/investigate.py:190-207` (pcp_quick_investigate docstring) +- Modify: `src/pmmcp/tools/timeseries.py:133-155` (pcp_fetch_timeseries docstring) +- Modify: `src/pmmcp/tools/anomaly.py:127-142` (pcp_detect_anomalies docstring) +- Modify: `src/pmmcp/tools/scanning.py:130-147` (pcp_scan_changes docstring) + +No test file — docstring changes are not testable via unit tests (the MCP tool description is extracted from the docstring at registration time; testing it would require spinning up the full MCP server). The existing contract tests in `tests/contract/test_prompts.py` cover prompt registration; tool docstrings are verified by manual inspection. + +- [ ] **Step 1: Add breadcrumb to `pcp_quick_investigate` docstring** + +In `src/pmmcp/tools/investigate.py`, append after the `host` arg line (before the closing `"""`): + +```python + """Start here for open-ended investigation. Only requires a time of interest. + + ...existing docstring... + + Note: For broad 'something is wrong' investigations spanning multiple subsystems, + prefer the ``coordinate_investigation`` prompt — it dispatches 6 specialist + sub-agents in parallel for comprehensive coverage. + """ +``` + +- [ ] **Step 2: Add breadcrumb to `pcp_fetch_timeseries` docstring** + +In `src/pmmcp/tools/timeseries.py`, append before the closing `"""`: + +```python + Note: For broad investigations, start with the ``coordinate_investigation`` prompt + rather than fetching metrics directly — it orchestrates a full multi-subsystem sweep. +``` + +- [ ] **Step 3: Add breadcrumb to `pcp_detect_anomalies` docstring** + +In `src/pmmcp/tools/anomaly.py`, append before the closing `"""`: + +```python + Note: For broad investigations, start with the ``coordinate_investigation`` prompt + rather than running anomaly detection directly — it orchestrates a full multi-subsystem sweep. +``` + +- [ ] **Step 4: Add breadcrumb to `pcp_scan_changes` docstring** + +In `src/pmmcp/tools/scanning.py`, append before the closing `"""`: + +```python + Note: For broad investigations, start with the ``coordinate_investigation`` prompt + rather than scanning changes directly — it orchestrates a full multi-subsystem sweep. +``` + +- [ ] **Step 5: Run lint to verify no formatting issues** + +Run: `uv run ruff check src/pmmcp/tools/investigate.py src/pmmcp/tools/timeseries.py src/pmmcp/tools/anomaly.py src/pmmcp/tools/scanning.py` +Expected: No errors + +- [ ] **Step 6: Commit** + +```bash +git add src/pmmcp/tools/investigate.py src/pmmcp/tools/timeseries.py src/pmmcp/tools/anomaly.py src/pmmcp/tools/scanning.py +git commit -m "feat: add coordinator breadcrumbs to tool docstrings + +Tool descriptions are the last thing Claude reads before deciding what +to call. These one-liners nudge toward coordinate_investigation for +broad investigations instead of bypassing the prompt hierarchy." +``` + +--- + +## Chunk 2: Prompt Changes + +### Task 3: Make `session_init` assertive about coordinator + add Grafana preflight + +**Files:** +- Modify: `src/pmmcp/prompts/session_init.py` +- Modify: `tests/unit/test_prompts_session_init.py` + +- [ ] **Step 1: Write failing tests for assertive language and Grafana preflight** + +Append to `tests/unit/test_prompts_session_init.py`: + +```python +def test_session_init_assertive_coordinator_language(): + """session_init uses assertive language (ALWAYS/DO NOT) for coordinator guidance.""" + result = _session_init_impl() + full_text = " ".join(msg["content"] for msg in result) + upper_text = full_text.upper() + assert "ALWAYS" in upper_text or "DO NOT" in upper_text or "MUST" in upper_text, ( + "session_init must use assertive language directing to coordinate_investigation" + ) + + +def test_session_init_coordinator_before_derived_metrics(): + """Coordinator guidance appears before the derived metrics registration steps.""" + result = _session_init_impl() + full_text = " ".join(msg["content"] for msg in result) + coord_pos = full_text.find("coordinate_investigation") + derive_pos = full_text.find("Step 1") + assert coord_pos < derive_pos, ( + "Coordinator guidance must appear before Step 1 (derived metrics)" + ) + + +def test_session_init_grafana_preflight_references(): + """session_init includes Grafana preflight discovery workflow.""" + result = _session_init_impl() + full_text = " ".join(msg["content"] for msg in result) + assert "list_datasources" in full_text, ( + "session_init must reference list_datasources for Grafana preflight" + ) + assert "performancecopilot" in full_text.lower(), ( + "session_init must reference PCP datasource type for validation" + ) + + +def test_session_init_grafana_fallback_cascade(): + """session_init includes fallback cascade when Grafana is unavailable.""" + result = _session_init_impl() + full_text = " ".join(msg["content"] for msg in result).lower() + assert "fallback" in full_text or "unavailable" in full_text, ( + "session_init must describe fallback when Grafana is unavailable" + ) +``` + +- [ ] **Step 2: Run tests to verify they fail** + +Run: `uv run pytest tests/unit/test_prompts_session_init.py -v -k "assertive or before_derived or preflight or fallback_cascade"` +Expected: FAIL — current session_init lacks assertive language, preflight, and fallback + +- [ ] **Step 3: Rewrite `_session_init_impl` content** + +In `src/pmmcp/prompts/session_init.py`, restructure the `content` f-string in `_session_init_impl`. The new structure is: + +1. **IMPORTANT block** (top) — assertive coordinator guidance +2. **Grafana Preflight** — datasource discovery + validation workflow +3. **Step 1** — Register Derived Metrics (existing, unchanged) +4. **Step 2** — Verify Availability (existing, unchanged) +5. **Step 3** — Report Results (existing, unchanged) + +Replace the `content = f"""..."""` block with: + +```python + content = f"""\ +You are initialising a PCP monitoring session{host_clause}{timerange_clause}. + +## IMPORTANT — Investigation Entry Point + +**ALWAYS** use the `coordinate_investigation` prompt for broad performance investigations. \ +It dispatches 6 specialist sub-agents (cpu, memory, disk, network, process, crosscutting) \ +in parallel, then synthesises findings into a unified root-cause narrative with \ +cross-subsystem correlation. **Do NOT** call individual tools (`pcp_fetch_timeseries`, \ +`pcp_detect_anomalies`, etc.) or specialist prompts directly unless you have a specific, \ +targeted question about a single known metric. + +## Grafana Preflight — Datasource Discovery + +Before any investigation, check whether Grafana visualisation is available: + +1. Call `mcp-grafana.list_datasources` to enumerate configured datasources. +2. Look for a datasource of type `performancecopilot-valkey-datasource` or \ +`performancecopilot-vector-datasource`. +3. If found, note its **UID** and **URL** for later dashboard creation. +4. **Match**: Grafana features enabled — investigation prompts will create dashboards \ +in the `pmmcp-triage` folder, named `YYYY-MM-DD `, tagged `pmmcp-generated`. +5. **No match / no mcp-grafana**: Grafana unavailable — fall back to text output or \ +offer an HTML report. This is not an error; pmmcp works fully standalone. + +## Step 1 — Register Derived Metrics + +Call `pcp_derive_metric` for each metric below: + +{metric_lines} + +Example call for the first metric: +``` +pcp_derive_metric( + name="derived.cpu.utilisation", + expr="100 - rate(kernel.all.cpu.idle) / hinv.ncpu / 10" +) +``` + +Registration is idempotent — re-registering an existing name overwrites silently. + +## Step 2 — Verify Availability + +After registering, verify each derived metric is resolvable by calling `pcp_fetch_live`: + +{verify_lines} + +## Step 3 — Report Results + +For each metric, report whether registration and verification succeeded or failed: + +- **Success**: `derived.cpu.utilisation` registered and verified ✓ +- **Failure**: `derived.disk.utilisation` failed verification — \ +`disk.all.avactive` may not be available on this host. Note and continue. + +**Do not abort if one or more verifications fail.** Report which metrics are available \ +and which are not, so downstream investigations know which derived metrics can be used. +""" +``` + +- [ ] **Step 4: Run tests to verify they pass** + +Run: `uv run pytest tests/unit/test_prompts_session_init.py -v` +Expected: ALL PASS (existing + new tests) + +- [ ] **Step 5: Commit** + +```bash +git add src/pmmcp/prompts/session_init.py tests/unit/test_prompts_session_init.py +git commit -m "feat: assertive coordinator guidance + Grafana preflight in session_init + +Moves coordinator reference from afterthought to IMPORTANT block at top. +Adds Grafana datasource discovery workflow with fallback cascade. +Per issue #10." +``` + +--- + +### Task 4: Add hierarchy context to specialist prompt docstring + +**Files:** +- Modify: `src/pmmcp/prompts/specialist.py:260-281` (docstring of `specialist_investigate`) +- Test: `tests/unit/test_prompts_specialist.py` + +- [ ] **Step 1: Write failing test for hierarchy context** + +Append to `tests/unit/test_prompts_specialist.py`: + +```python +def test_specialist_docstring_references_coordinator(): + """specialist_investigate docstring references coordinate_investigation as parent.""" + from pmmcp.prompts.specialist import specialist_investigate + + docstring = specialist_investigate.__doc__ + assert "coordinate_investigation" in docstring, ( + "specialist_investigate docstring must reference coordinate_investigation" + ) +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `uv run pytest tests/unit/test_prompts_specialist.py::test_specialist_docstring_references_coordinator -v` +Expected: FAIL — current docstring doesn't mention coordinate_investigation + +- [ ] **Step 3: Update specialist_investigate docstring** + +In `src/pmmcp/prompts/specialist.py`, update the `specialist_investigate` docstring: + +```python +@mcp.prompt() +def specialist_investigate( + subsystem: str, + request: str | None = None, + host: str | None = None, + time_of_interest: str | None = None, + lookback: str | None = None, +) -> list[dict]: + """Deep domain-expert investigation for a specific subsystem. + + Typically dispatched by ``coordinate_investigation`` as part of a parallel + 6-specialist sweep. For broad 'something is wrong' investigations, start + with the coordinator prompt instead of calling this directly. + + Each subsystem (cpu, memory, disk, network, process, crosscutting) + carries concrete investigation heuristics, metric relationships, and + interpretation guidance from an experienced performance engineer. + + Args: + subsystem: One of: cpu, memory, disk, network, process, crosscutting + request: What to investigate (e.g., "high latency") — optional + host: Target host (all hosts if omitted) — optional + time_of_interest: Centre of investigation window (default: now) — optional + lookback: Window size around time_of_interest (default: 2hours) — optional + """ + return _specialist_investigate_impl(subsystem, request, host, time_of_interest, lookback) +``` + +- [ ] **Step 4: Run test to verify it passes** + +Run: `uv run pytest tests/unit/test_prompts_specialist.py -v` +Expected: ALL PASS + +- [ ] **Step 5: Commit** + +```bash +git add src/pmmcp/prompts/specialist.py tests/unit/test_prompts_specialist.py +git commit -m "feat: add hierarchy context to specialist_investigate docstring + +Tells Claude this prompt is typically dispatched by the coordinator, +nudging toward coordinate_investigation for broad investigations." +``` + +--- + +### Task 5: Add Phase 3 visualisation to coordinator prompt + +**Files:** +- Modify: `src/pmmcp/prompts/coordinator.py` +- Modify: `tests/unit/test_prompts_coordinator.py` + +- [ ] **Step 1: Write failing tests for Phase 3 visualisation** + +Append to `tests/unit/test_prompts_coordinator.py`: + +```python +def test_coordinator_phase3_visualisation(): + """Coordinator includes Phase 3 visualisation guidance.""" + from pmmcp.prompts.coordinator import _coordinate_investigation_impl + + text = _coordinate_investigation_impl(request="app is slow")[0]["content"] + assert "Phase 3" in text or "phase 3" in text.lower(), ( + "Coordinator missing Phase 3" + ) + assert "dashboard" in text.lower(), ( + "Coordinator Phase 3 must reference dashboard creation" + ) + + +def test_coordinator_grafana_conventions(): + """Coordinator includes Grafana dashboard conventions from issue #10.""" + from pmmcp.prompts.coordinator import _coordinate_investigation_impl + + text = _coordinate_investigation_impl(request="app is slow")[0]["content"] + assert "pmmcp-triage" in text, "Missing folder convention" + assert "pmmcp-generated" in text, "Missing tag convention" + assert "YYYY-MM-DD" in text, "Missing naming convention" + + +def test_coordinator_visualisation_fallback_cascade(): + """Coordinator includes fallback cascade (Grafana -> HTML -> text).""" + from pmmcp.prompts.coordinator import _coordinate_investigation_impl + + text = _coordinate_investigation_impl(request="app is slow")[0]["content"].lower() + assert "fallback" in text or "unavailable" in text, ( + "Coordinator missing visualisation fallback cascade" + ) + + +def test_coordinator_deeplink_guidance(): + """Coordinator instructs returning a deeplink after dashboard creation.""" + from pmmcp.prompts.coordinator import _coordinate_investigation_impl + + text = _coordinate_investigation_impl(request="app is slow")[0]["content"].lower() + assert "deeplink" in text or "deep link" in text or "url" in text, ( + "Coordinator must instruct returning dashboard URL/deeplink" + ) +``` + +- [ ] **Step 2: Run tests to verify they fail** + +Run: `uv run pytest tests/unit/test_prompts_coordinator.py -v -k "phase3 or grafana_conventions or fallback_cascade or deeplink"` +Expected: FAIL — current coordinator has no Phase 3 + +- [ ] **Step 3: Add Phase 3 to coordinator prompt content** + +In `src/pmmcp/prompts/coordinator.py`, append Phase 3 after the `## Output Structure` section in the `content` f-string (before the closing `"""`): + +```python +## Phase 3 — Visualisation + +After synthesis, create a visual record of the investigation: + +### Grafana Dashboard (preferred) + +If `mcp-grafana` tools are available in this session: + +1. Create (or find) a folder named **`pmmcp-triage`** using `mcp-grafana.search_folders` / \ +`mcp-grafana.create_folder`. +2. Create a dashboard using `mcp-grafana.update_dashboard`: + - **Title**: `YYYY-MM-DD ` (e.g., `2026-03-10 memory cascade saas-prod-01`) + - **Tags**: always include `pmmcp-generated` + - **Folder**: `pmmcp-triage` + - **Panels**: one panel per key finding — memory, swap, CPU, disk, network as relevant. \ +Use the PCP datasource UID discovered during session preflight. +3. Call `mcp-grafana.generate_deeplink` and return the dashboard URL to the user. + +### HTML Fallback + +If mcp-grafana is unavailable, offer to generate a self-contained HTML report: +- Save to the configured report directory (default `~/.pmmcp/reports/`) +- Name: `YYYY-MM-DD-.html` +- Include investigation summary, data tables, and narrative + +### Text Fallback + +If the user declines both, the synthesised text output from Phase 2 stands on its own. + +### Auto-Trigger Heuristic + +If your investigation has surfaced findings across 3+ metrics or 2+ subsystems, and you \ +have not already created a visualisation, proactively offer to create a dashboard — don't \ +wait to be asked. +``` + +- [ ] **Step 4: Run tests to verify they pass** + +Run: `uv run pytest tests/unit/test_prompts_coordinator.py -v` +Expected: ALL PASS (existing + new tests) + +- [ ] **Step 5: Commit** + +```bash +git add src/pmmcp/prompts/coordinator.py tests/unit/test_prompts_coordinator.py +git commit -m "feat: add Phase 3 visualisation to coordinator prompt + +After synthesis, coordinator now instructs Claude to create a Grafana +dashboard (pmmcp-triage folder, YYYY-MM-DD naming, pmmcp-generated tag) +with fallback cascade to HTML or text. Per issue #10." +``` + +--- + +### Task 6: Update CLAUDE.md and README.md with new conventions + +**Files:** +- Modify: `CLAUDE.md` +- Modify: `README.md` (new env vars: `PMMCP_GRAFANA_FOLDER`, `PMMCP_REPORT_DIR`) + +- [ ] **Step 1: Add Grafana dashboard conventions section** + +In `CLAUDE.md`, add after the "Grafana Compose Gotchas" section: + +```markdown +## Grafana Dashboard Conventions (Investigation Output) + +When creating dashboards as part of an investigation: + +| Convention | Value | +|-----------|-------| +| Folder | `pmmcp-triage` (configurable via `PMMCP_GRAFANA_FOLDER`) | +| Naming | `YYYY-MM-DD ` (e.g., `2026-03-10 memory cascade saas-prod-01`) | +| Tagging | Always include `pmmcp-generated` | +| Deeplink | After creation, call `generate_deeplink` and return URL to user | +| Auto-trigger | Offer visualisation when findings span 3+ metrics or 2+ subsystems | + +## Investigation Prompt Hierarchy + +The investigation prompt hierarchy is: + +``` +session_init → coordinate_investigation → specialist_investigate (×6) +``` + +- **ALWAYS** start broad investigations with `coordinate_investigation` +- **DO NOT** call raw tools (`pcp_fetch_timeseries`, `pcp_detect_anomalies`) directly for open-ended investigations +- Specialist prompts are dispatched by the coordinator — don't call them directly unless targeting a specific subsystem +``` + +- [ ] **Step 2: Add new env vars to README.md configuration table** + +In `README.md`, find the configuration/environment variables section and add: + +| Variable | Default | Description | +|----------|---------|-------------| +| `PMMCP_GRAFANA_FOLDER` | `pmmcp-triage` | Grafana folder for investigation dashboards | +| `PMMCP_REPORT_DIR` | `~/.pmmcp/reports` | Output directory for HTML fallback reports | + +- [ ] **Step 3: Commit** + +```bash +git add CLAUDE.md README.md +git commit -m "docs: add investigation hierarchy, Grafana conventions, and new env vars" +``` + +--- + +### Task 7: Pre-push sanity check + +- [ ] **Step 1: Sync dev environment** + +Run: `uv sync --extra dev` + +- [ ] **Step 2: Run full test suite with coverage** + +Run: `uv run pytest --cov=pmmcp --cov-report=term-missing` +Expected: ALL PASS, coverage ≥80% + +- [ ] **Step 3: Run lint and format** + +Run: `uv run ruff check src/ tests/ && uv run ruff format --check src/ tests/` +Expected: No errors + +- [ ] **Step 4: Run pre-push sanity (or `just ci` in VM)** + +Run: `just ci` (VM) or `./pre-commit.sh` (host with podman) +Expected: All green + +- [ ] **Step 5: Push** + +Run: `git push` diff --git a/docs/superpowers/specs/2026-03-10-investigation-hierarchy-guardrails-design.md b/docs/superpowers/specs/2026-03-10-investigation-hierarchy-guardrails-design.md new file mode 100644 index 0000000..5a01874 --- /dev/null +++ b/docs/superpowers/specs/2026-03-10-investigation-hierarchy-guardrails-design.md @@ -0,0 +1,103 @@ +# Investigation Hierarchy Guardrails + +**Date:** 2026-03-10 +**Issue:** Partial implementation of #10 (External Integration Contracts) +**Problem:** Claude bypasses the coordinator prompt and calls raw pmmcp tools directly, missing the investigation workflow and Grafana visualisation step. + +## Context + +The investigation prompt hierarchy is: + +``` +session_init → coordinate_investigation → specialist_investigate (×6) +``` + +In practice, Claude skips the entire funnel and calls `pcp_quick_investigate`, `pcp_fetch_timeseries`, etc. directly. This means: + +1. No parallel specialist dispatch (slower, less thorough) +2. No structured synthesis with cross-subsystem correlation +3. No Grafana dashboard creation (the "paint a picture" gap) +4. No fallback cascade (Grafana → HTML → text) + +## Changes + +### A. Tool descriptions — coordinator breadcrumbs + +Add a one-liner to tool docstrings that the LLM reads at decision time: + +| Tool | Added line | +|------|-----------| +| `pcp_quick_investigate` | "For broad investigations, use the `coordinate_investigation` prompt instead — it dispatches 6 specialists in parallel." | +| `pcp_fetch_timeseries` | "For broad investigations, start with `coordinate_investigation` rather than fetching metrics directly." | +| `pcp_detect_anomalies` | Same pattern | +| `pcp_scan_changes` | Same pattern | + +These go at the end of the existing docstring, clearly separated. + +### B. `session_init` — assertive coordinator guidance + Grafana preflight + +Two additions: + +1. **Coordinator guidance** moves from a polite "Next Step" afterthought at the bottom to an **IMPORTANT** block near the top. Language: "ALWAYS use `coordinate_investigation` for broad performance investigations. Do NOT call individual tools or specialist prompts directly unless you have a specific, targeted question about a single metric." + +2. **Grafana preflight discovery** (per issue #10): + - Call `mcp-grafana.list_datasources` + - Find datasource of type `performancecopilot-*` + - Validate shared pmproxy URL + - Cache datasource UID in conversation context + - Log result (available / unavailable) + +### C. Specialist prompt descriptions — hierarchy context + +Add to `specialist_investigate` docstring: + +> "Typically dispatched by `coordinate_investigation` as part of a parallel 6-specialist sweep. For broad 'something is wrong' investigations, start with the coordinator instead." + +### D. `coordinate_investigation` — Phase 3 visualisation + +Add Phase 3 after synthesis (per issue #10): + +> "After synthesis, if mcp-grafana is available, create a triage dashboard in the configured folder (default `pmmcp-triage`), named `YYYY-MM-DD `, tagged `pmmcp-generated`, and return the deeplink to the user." + +Includes the fallback cascade: +1. Grafana available → dashboard + deeplink +2. Grafana unavailable → offer HTML report to `report_dir` +3. User declines both → text/table output (existing behaviour) + +Also includes the auto-trigger heuristic: "If your investigation has surfaced findings across 3+ metrics or 2+ subsystems, and you haven't already created a visualisation, consider offering one." + +### E. Config additions + +Add to `ServerConfig` (env prefix `PMMCP_`, not `PmproxyConfig` which is for pmproxy connection settings): + +| Setting | Env var | Default | Purpose | +|---------|---------|---------|---------| +| `grafana_folder` | `PMMCP_GRAFANA_FOLDER` | `pmmcp-triage` | Grafana folder for generated dashboards | +| `report_dir` | `PMMCP_REPORT_DIR` | `~/.pmmcp/reports/` | Output directory for HTML fallback reports | + +## Files Modified + +| File | Change | +|------|--------| +| `src/pmmcp/tools/timeseries.py` | Docstring additions to `pcp_fetch_timeseries` | +| `src/pmmcp/tools/discovery.py` | Docstring additions to `pcp_quick_investigate` | +| `src/pmmcp/tools/comparison.py` | Docstring additions to `pcp_detect_anomalies`, `pcp_scan_changes` | +| `src/pmmcp/prompts/session_init.py` | Assertive coordinator guidance + Grafana preflight | +| `src/pmmcp/prompts/specialist.py` | Hierarchy context in docstring | +| `src/pmmcp/prompts/coordinator.py` | Phase 3 visualisation + fallback cascade | +| `src/pmmcp/config.py` | `grafana_folder`, `report_dir` settings | +| `CLAUDE.md` | Document new conventions | + +## Files NOT Modified + +- No new tools (zero Python tool code beyond config) +- No Grafana JSON templates +- No new dependencies +- Specialist prompt `_SPECIALIST_KNOWLEDGE` content unchanged — only the `@mcp.prompt()` docstring and `specialist_investigate` description change + +## Testing Strategy + +- Unit tests for new config fields (defaults, env var override) +- Contract tests confirming prompt output includes new guidance text +- Existing tests remain unchanged (no behavioural changes to tools) +- Manual validation: run an investigation session and verify Claude follows the hierarchy diff --git a/grafana/provisioning/datasources/pcp.yaml b/grafana/provisioning/datasources/pcp.yaml new file mode 100644 index 0000000..6306bb9 --- /dev/null +++ b/grafana/provisioning/datasources/pcp.yaml @@ -0,0 +1,14 @@ +apiVersion: 1 +datasources: + - name: PCP Valkey + type: performancecopilot-valkey-datasource + access: proxy + url: http://pcp:44322 + isDefault: true + editable: true + + - name: PCP Vector + type: performancecopilot-vector-datasource + access: proxy + url: http://pcp:44322 + editable: true diff --git a/pre-commit.sh b/pre-commit.sh index 17db5f8..063441a 100755 --- a/pre-commit.sh +++ b/pre-commit.sh @@ -5,11 +5,6 @@ SCRIPT_DIR=$(cd "$(dirname "$0")" && pwd) cd "$SCRIPT_DIR" just ci - -if [ -n "${PMPROXY_URL:-}" ]; then - just e2e -else - echo " - e2e skipped (set PMPROXY_URL to run, or: just e2e)" -fi +just e2e echo "pre-commit passed" diff --git a/specs/012-grafana-compose/checklists/requirements.md b/specs/012-grafana-compose/checklists/requirements.md new file mode 100644 index 0000000..c2c538a --- /dev/null +++ b/specs/012-grafana-compose/checklists/requirements.md @@ -0,0 +1,36 @@ +# Specification Quality Checklist: Grafana & mcp-grafana Compose Integration + +**Purpose**: Validate specification completeness and quality before proceeding to planning +**Created**: 2026-03-10 +**Feature**: [spec.md](../spec.md) + +## Content Quality + +- [x] No implementation details (languages, frameworks, APIs) +- [x] Focused on user value and business needs +- [x] Written for non-technical stakeholders +- [x] All mandatory sections completed + +## Requirement Completeness + +- [x] No [NEEDS CLARIFICATION] markers remain +- [x] Requirements are testable and unambiguous +- [x] Success criteria are measurable +- [x] Success criteria are technology-agnostic (no implementation details) +- [x] All acceptance scenarios are defined +- [x] Edge cases are identified +- [x] Scope is clearly bounded +- [x] Dependencies and assumptions identified + +## Feature Readiness + +- [x] All functional requirements have clear acceptance criteria +- [x] User scenarios cover primary flows +- [x] Feature meets measurable outcomes defined in Success Criteria +- [x] No implementation details leak into specification + +## Notes + +- FR-003 and FR-008 mention specific Grafana configuration mechanisms (provisioning YAML, `GF_INSTALL_PLUGINS`) — these are borderline implementation detail but are domain-standard terminology that any stakeholder familiar with Grafana would recognise. Kept for clarity. +- The spec deliberately references `docker-compose.yml` and port numbers as these are part of the project's existing conventions, not new implementation choices. +- All items pass validation. Spec is ready for `/speckit.clarify` or `/speckit.plan`. diff --git a/specs/012-grafana-compose/contracts/compose-services.md b/specs/012-grafana-compose/contracts/compose-services.md new file mode 100644 index 0000000..2dbc8fc --- /dev/null +++ b/specs/012-grafana-compose/contracts/compose-services.md @@ -0,0 +1,78 @@ +# Compose Service Contracts: Grafana Integration + +**Date**: 2026-03-10 + +## New Services + +### grafana + +| Property | Value | +|----------|-------| +| Image | `grafana/grafana:latest` | +| Port | `3000:3000` | +| Depends on | `pcp` (service_started) | +| Healthcheck | `curl -sf http://localhost:3000/api/health` | +| Volumes | `./grafana/provisioning:/etc/grafana/provisioning:ro` | + +**Environment**: +``` +GF_INSTALL_PLUGINS=https://github.com/performancecopilot/grafana-pcp/releases/download/v5.3.0/performancecopilot-pcp-app-5.3.0.zip;performancecopilot-pcp-app +GF_PLUGINS_ALLOW_LOADING_UNSIGNED_PLUGINS=performancecopilot-pcp-app,performancecopilot-valkey-datasource,performancecopilot-vector-datasource,performancecopilot-bpftrace-datasource,performancecopilot-flamegraph-panel,performancecopilot-breadcrumbs-panel,performancecopilot-troubleshooting-panel +GF_AUTH_ANONYMOUS_ENABLED=true +GF_AUTH_ANONYMOUS_ORG_ROLE=Admin +GF_SECURITY_ADMIN_USER=admin +GF_SECURITY_ADMIN_PASSWORD=admin +``` + +### mcp-grafana + +| Property | Value | +|----------|-------| +| Image | `mcp/grafana` | +| Port | `8000:8000` | +| Depends on | `grafana` (service_healthy) | +| Transport | SSE (default for Docker image) | +| Endpoint | `/sse` on port 8000 | + +**Environment**: +``` +GRAFANA_URL=http://grafana:3000 +GRAFANA_USERNAME=admin +GRAFANA_PASSWORD=admin +``` + +## Provisioned Datasources + +File: `grafana/provisioning/datasources/pcp.yaml` + +```yaml +apiVersion: 1 +datasources: + - name: PCP Valkey + type: performancecopilot-valkey-datasource + access: proxy + url: http://pcp:44322 + isDefault: true + editable: true + + - name: PCP Vector + type: performancecopilot-vector-datasource + access: proxy + url: http://pcp:44322 + editable: true +``` + +## Service Dependency Chain + +``` +redis-stack (healthy) + ↓ +pmlogsynth-generator (completed) → pmlogsynth-seeder (completed) + ↓ +pcp (started) + ↓ +├── pmmcp (started) +└── grafana (healthy) + ↓ + mcp-grafana (started) +``` diff --git a/specs/012-grafana-compose/data-model.md b/specs/012-grafana-compose/data-model.md new file mode 100644 index 0000000..6b7a3cd --- /dev/null +++ b/specs/012-grafana-compose/data-model.md @@ -0,0 +1,56 @@ +# Data Model: Grafana & mcp-grafana Compose Integration + +**Date**: 2026-03-10 +**Branch**: `012-grafana-compose` + +## Overview + +This feature is infrastructure-only — no new application data entities are introduced. The "data model" here describes the compose service topology and configuration relationships. + +## Service Topology + +``` +┌─────────────────────┐ ┌──────────────┐ +│ pmlogsynth-generator│────▶│ pmmcp-archives│ (named volume) +└─────────────────────┘ └──────┬───────┘ + │ +┌──────────────┐ ┌──────▼───────┐ +│ redis-stack │◀───────────│pmlogsynth- │ +│ (6379) │ │seeder │ +└──────┬───────┘ └──────────────┘ + │ +┌──────▼───────┐ +│ pcp │◀─── KEY_SERVERS=redis-stack:6379 +│ (44322) │ +└──┬───────┬───┘ + │ │ + │ │ ┌──────────────┐ + │ └──▶│ grafana │◀── PCP datasource provisioned + │ │ (3000) │ via /etc/grafana/provisioning/ + │ └──────┬───────┘ + │ │ + │ ┌──────▼───────┐ + │ │ mcp-grafana │◀── GRAFANA_URL=http://grafana:3000 + │ │ (8000/sse) │ GRAFANA_USERNAME=admin + │ └──────────────┘ GRAFANA_PASSWORD=admin + │ +┌──▼───────────┐ +│ pmmcp │◀── PMPROXY_URL=http://pcp:44322 +│ (8080) │ +└──────────────┘ +``` + +## Configuration Relationships + +| Setting | Source | Consumers | Must Match | +|---------|--------|-----------|------------| +| pmproxy URL (`http://pcp:44322`) | `pcp` service | pmmcp, grafana PCP datasource | Yes — single source of truth | +| Redis/Valkey (`redis-stack:6379`) | `redis-stack` service | pcp, pmlogsynth-seeder | Yes | +| Grafana URL (`http://grafana:3000`) | `grafana` service | mcp-grafana | Yes | +| Grafana credentials (`admin/admin`) | Grafana env vars | mcp-grafana env vars | Yes | + +## New Files + +| File | Purpose | +|------|---------| +| `grafana/provisioning/datasources/pcp.yaml` | Grafana datasource provisioning — auto-configures PCP Valkey + Vector datasources | diff --git a/specs/012-grafana-compose/plan.md b/specs/012-grafana-compose/plan.md new file mode 100644 index 0000000..9330581 --- /dev/null +++ b/specs/012-grafana-compose/plan.md @@ -0,0 +1,62 @@ +# Implementation Plan: Grafana & mcp-grafana Compose Integration + +**Branch**: `012-grafana-compose` | **Date**: 2026-03-10 | **Spec**: [spec.md](spec.md) +**Input**: Feature specification from `/specs/012-grafana-compose/spec.md` + +## Summary + +Add Grafana (with auto-provisioned PCP datasources) and mcp-grafana to the existing docker-compose stack, enabling local visualisation and the foundation for Issue #10's prompt-driven Grafana integration. No new Python code — purely infrastructure (compose services, provisioning files, CI updates). + +## Technical Context + +**Language/Version**: N/A (infrastructure-only; compose YAML, Grafana provisioning YAML) +**Primary Dependencies**: `grafana/grafana:latest`, `mcp/grafana` (Docker Hub), `performancecopilot-pcp-app` plugin v5.3.0 +**Storage**: Ephemeral — no persistent volumes for Grafana +**Testing**: E2E smoke tests (curl-based healthchecks + datasource verification); existing pytest E2E suite must not regress +**Target Platform**: podman compose (local) + docker compose (CI, GitHub Actions) +**Project Type**: single (infrastructure addition to existing compose stack) +**Performance Goals**: All services healthy within 90 seconds of `compose up` +**Constraints**: Plugin is unsigned (requires `GF_PLUGINS_ALLOW_LOADING_UNSIGNED_PLUGINS`); mcp-grafana requires auth (basic auth with `admin/admin`) +**Scale/Scope**: 2 new compose services, 1 provisioning file, CI workflow update + +## Constitution Check + +*GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.* + +| Principle | Status | Notes / Justification | +|-----------|--------|----------------------| +| I. Code Quality — linting enforced, single-responsibility, complexity ≤ 10, peer review, documentation impact reviewed | PASS | No new Python code. Compose YAML and provisioning YAML are declarative. Documentation (README, CLAUDE.md, docker-compose.yml comments) will be updated in the same PR. | +| II. Testing Standards — TDD cycle, ≥ 80% unit coverage, contract tests on interface changes | PASS | No new Python code means no unit coverage impact. E2E smoke tests verify service health. Existing E2E tests must not regress (SC-005). TDD applies to any new pytest fixtures if added. | +| III. UX Consistency — design system adherence, WCAG 2.1 AA, actionable error messages | N/A | No user-facing UI changes. Grafana provides its own UX. | +| IV. Performance — latency SLA defined, performance budget in CI, profiling before optimization | PASS | SC-001 defines 90-second startup SLA. No runtime performance impact on pmmcp — Grafana and mcp-grafana are independent services. | +| V. Simplicity — YAGNI posture, no speculative abstractions, complexity justified below | PASS | Minimal additions: 2 services in compose, 1 provisioning file. No custom images, no init containers, no token bootstrapping — basic auth is the simplest working path. | + +## Project Structure + +### Documentation (this feature) + +```text +specs/012-grafana-compose/ +├── plan.md # This file +├── research.md # Phase 0: mcp-grafana, PCP plugin, auth, CI research +├── data-model.md # Service topology and configuration relationships +├── quickstart.md # Developer quickstart for the Grafana stack +├── contracts/ +│ └── compose-services.md # Service definitions, env vars, dependency chain +├── checklists/ +│ └── requirements.md # Spec quality checklist +└── tasks.md # Phase 2 output (created by /speckit.tasks) +``` + +### Source Code (repository root) + +```text +docker-compose.yml # Modified: add grafana + mcp-grafana services +grafana/ +└── provisioning/ + └── datasources/ + └── pcp.yaml # New: auto-provision PCP Valkey + Vector datasources +.github/workflows/ci.yml # Modified: add Grafana wait step in E2E job +``` + +**Structure Decision**: No new Python source files. The feature is entirely compose infrastructure + Grafana provisioning configuration. The only code changes are to `docker-compose.yml` and `.github/workflows/ci.yml`. diff --git a/specs/012-grafana-compose/quickstart.md b/specs/012-grafana-compose/quickstart.md new file mode 100644 index 0000000..e6249ec --- /dev/null +++ b/specs/012-grafana-compose/quickstart.md @@ -0,0 +1,64 @@ +# Quickstart: Grafana & mcp-grafana Compose Integration + +**Branch**: `012-grafana-compose` + +## Prerequisites + +- podman + podman-compose installed +- Existing pmmcp compose stack working (`podman compose up -d` passes healthchecks) + +## Start the Stack + +```bash +podman compose up -d --wait --wait-timeout 120 +``` + +This starts all services including Grafana and mcp-grafana. + +## Verify Grafana + +1. Open http://localhost:3000 in a browser (no login required — anonymous admin enabled) +2. Navigate to Connections → Data sources +3. Confirm "PCP Valkey" and "PCP Vector" datasources are listed and healthy + +Or via API: +```bash +curl -s http://localhost:3000/api/datasources | python3 -m json.tool +``` + +## Verify mcp-grafana + +mcp-grafana runs in SSE mode on port 8000: + +```bash +# Check mcp-grafana is responding +curl -s http://localhost:8000/sse +``` + +## Verify Shared Backend + +The PCP datasources in Grafana and pmmcp both target the same pmproxy (`http://pcp:44322`). To confirm: + +```bash +# Query a metric via pmproxy directly +curl -s "http://localhost:44322/pmapi/metric?names=kernel.all.load" + +# Same metric should be explorable in Grafana via PCP Vector datasource +``` + +## Teardown + +```bash +podman compose down --volumes +``` + +All Grafana state is ephemeral — dashboards and preferences are destroyed on teardown. + +## Troubleshooting + +| Symptom | Cause | Fix | +|---------|-------|-----| +| Grafana shows "plugin not found" | PCP plugin failed to download | Check container logs: `podman compose logs grafana` — network issue or wrong ZIP URL | +| PCP datasource unhealthy | pmproxy not ready when Grafana started | Wait and retry — Grafana auto-retries datasource connections | +| mcp-grafana "connection refused" | Grafana not healthy yet | mcp-grafana depends on Grafana healthcheck — check `podman compose ps` | +| Unsigned plugin error | Missing `GF_PLUGINS_ALLOW_LOADING_UNSIGNED_PLUGINS` | Verify the env var lists all PCP plugin IDs | diff --git a/specs/012-grafana-compose/research.md b/specs/012-grafana-compose/research.md new file mode 100644 index 0000000..4e39fdd --- /dev/null +++ b/specs/012-grafana-compose/research.md @@ -0,0 +1,105 @@ +# Research: Grafana & mcp-grafana Compose Integration + +**Date**: 2026-03-10 +**Branch**: `012-grafana-compose` + +## R1: mcp-grafana Container Image & Transport + +**Decision**: Use `mcp/grafana` image from Docker Hub in SSE mode (port 8000, path `/sse`). + +**Rationale**: The Docker image defaults to SSE transport, which is the natural fit for a compose service (unlike stdio which requires a parent process). SSE on port 8000 is the documented default. + +**Alternatives considered**: +- `stdio` transport — requires wrapping in a parent process, not suitable for compose service +- `streamable-http` — newer, production-oriented, but SSE is simpler and sufficient for dev/test +- Build from source — unnecessary, official image is available and maintained + +## R2: mcp-grafana Authentication + +**Decision**: Use basic auth (`GRAFANA_USERNAME=admin`, `GRAFANA_PASSWORD=admin`) for the dev/test compose stack. + +**Rationale**: mcp-grafana does **not** support anonymous access — it always requires one of: service account token, legacy API key, or basic auth. Basic auth with Grafana's default `admin/admin` is the simplest approach for a dev stack. No init container or API call needed to bootstrap a service account token. + +**Alternatives considered**: +- Service account token (`GRAFANA_SERVICE_ACCOUNT_TOKEN`) — requires creating the token after Grafana starts (init container or API call), adds complexity for no dev/test benefit +- Legacy API key (`GRAFANA_API_KEY`) — deprecated by Grafana, not recommended +- Anonymous admin with no auth — not supported by mcp-grafana + +**mcp-grafana environment variables**: +- `GRAFANA_URL` — Grafana instance URL (e.g., `http://grafana:3000`) +- `GRAFANA_USERNAME` — basic auth username +- `GRAFANA_PASSWORD` — basic auth password + +## R3: PCP Grafana Plugin Installation + +**Decision**: Install `performancecopilot-pcp-app` via `GF_INSTALL_PLUGINS` with direct ZIP URL from GitHub releases. Allow unsigned plugins via `GF_PLUGINS_ALLOW_LOADING_UNSIGNED_PLUGINS`. + +**Rationale**: The plugin is unsigned (community-maintained), so it's not available via the standard Grafana plugin catalog shorthand. The GitHub releases ZIP URL is the documented installation method. + +**Alternatives considered**: +- Custom Grafana image with plugin pre-installed — adds a build step and image maintenance burden +- Grafana plugin catalog shorthand — doesn't work for unsigned plugins + +**Configuration**: +``` +GF_INSTALL_PLUGINS=https://github.com/performancecopilot/grafana-pcp/releases/download/v5.3.0/performancecopilot-pcp-app-5.3.0.zip;performancecopilot-pcp-app +GF_PLUGINS_ALLOW_LOADING_UNSIGNED_PLUGINS=performancecopilot-pcp-app,performancecopilot-valkey-datasource,performancecopilot-vector-datasource,performancecopilot-bpftrace-datasource,performancecopilot-flamegraph-panel,performancecopilot-breadcrumbs-panel,performancecopilot-troubleshooting-panel +``` + +## R4: PCP Datasource Types & Provisioning + +**Decision**: Provision two datasources — `performancecopilot-valkey-datasource` (historical/timeseries) and `performancecopilot-vector-datasource` (live metrics). Both point at `http://pcp:44322`. + +**Rationale**: The Valkey datasource uses pmproxy's `/series/*` endpoints (backed by Redis/Valkey — already in our compose stack). The Vector datasource uses `/pmapi/*` for live metrics. Both are useful for the Issue #10 visualisation workflows. + +**Alternatives considered**: +- Valkey datasource only — misses live metric visualisation +- bpftrace datasource — requires bpftrace PMDA, out of scope + +**Provisioning YAML** (`grafana/provisioning/datasources/pcp.yaml`): +```yaml +apiVersion: 1 +datasources: + - name: PCP Valkey + type: performancecopilot-valkey-datasource + access: proxy + url: http://pcp:44322 + isDefault: true + editable: true + + - name: PCP Vector + type: performancecopilot-vector-datasource + access: proxy + url: http://pcp:44322 + editable: true +``` + +## R5: Grafana Version & Configuration + +**Decision**: Use `grafana/grafana:latest` with anonymous auth enabled for browser access, plus basic auth credentials for mcp-grafana API access. + +**Rationale**: Plugin requires Grafana >=9.0.9. Using `latest` keeps us current. Anonymous auth for browser means no login prompt for developers, while mcp-grafana uses basic auth for API calls. Grafana supports both simultaneously. + +**Alternatives considered**: +- Pin specific version (e.g., `grafana/grafana:11.x`) — better for reproducibility but adds maintenance; can pin later if instability appears +- Disable anonymous auth — forces login, friction for dev workflow + +**Grafana env vars**: +``` +GF_AUTH_ANONYMOUS_ENABLED=true +GF_AUTH_ANONYMOUS_ORG_ROLE=Admin +GF_SECURITY_ADMIN_USER=admin +GF_SECURITY_ADMIN_PASSWORD=admin +``` + +## R6: CI Parity Approach + +**Decision**: Add Grafana and mcp-grafana to the CI E2E job using the same `docker compose up -d` approach already used for pcp/redis. No separate service containers needed. + +**Rationale**: The CI workflow already uses `docker compose up -d --wait` to start the full compose stack. Adding Grafana and mcp-grafana to `docker-compose.yml` means CI automatically picks them up. This maintains the parity convention. + +**Alternatives considered**: +- GitHub Actions service containers — would diverge from compose topology, exactly the anti-pattern documented in CLAUDE.md +- Separate Grafana compose file — unnecessary fragmentation + +**CI changes**: Minimal — add a "Wait for Grafana" step after "Wait for pmproxy" to confirm Grafana is healthy before E2E tests run. Also set `GRAFANA_URL=http://localhost:3000` env var for any Grafana-specific E2E tests. diff --git a/specs/012-grafana-compose/spec.md b/specs/012-grafana-compose/spec.md new file mode 100644 index 0000000..cc75b68 --- /dev/null +++ b/specs/012-grafana-compose/spec.md @@ -0,0 +1,101 @@ +# Feature Specification: Grafana & mcp-grafana Compose Integration + +**Feature Branch**: `012-grafana-compose` +**Created**: 2026-03-10 +**Status**: Draft +**Input**: User description: "Issue #10 - we'll definitely need to add in Grafana and mcp-grafana into our docker-compose setup." +**Related**: [Issue #10 — External Integration Contracts (Grafana MCP, reporting strategy)](https://github.com/tallpsmith/pmmcp/issues/10) + +## User Scenarios & Testing *(mandatory)* + +### User Story 1 — Grafana with PCP Datasource Available Out of the Box (Priority: P1) + +A developer runs `podman compose up -d` and gets a fully wired Grafana instance with the PCP datasource plugin pre-configured and pointing at the same pmproxy that pmmcp uses. No manual Grafana setup required — the datasource is provisioned automatically. + +**Why this priority**: Without a pre-configured Grafana instance, nothing in Issue #10's prompt-driven visualisation workflow can function. This is the foundation. + +**Independent Test**: After `podman compose up -d`, navigate to Grafana's datasource API and confirm the PCP datasource exists, is healthy, and targets the correct pmproxy URL. + +**Acceptance Scenarios**: + +1. **Given** the compose stack is running, **When** a user opens Grafana in a browser (`http://localhost:3000`), **Then** Grafana is accessible without additional setup and the PCP datasource is listed and healthy. +2. **Given** the compose stack is running, **When** a user queries Grafana's datasource health endpoint for the PCP datasource, **Then** it reports success (pmproxy is reachable). +3. **Given** the compose stack is running, **When** a user explores metrics via the PCP datasource in Grafana, **Then** the same metrics visible through pmmcp's tools are available (they share the same pmproxy backend). + +--- + +### User Story 2 — mcp-grafana Available as a Compose Service (Priority: P2) + +A developer can start mcp-grafana alongside pmmcp so that Claude (or integration tests) can orchestrate between both MCP servers. mcp-grafana connects to the same Grafana instance from Story 1. + +**Why this priority**: mcp-grafana is the mechanism Claude uses to create dashboards. Without it in the compose stack, the prompt-driven visualisation from Issue #10 cannot be tested end-to-end locally. + +**Independent Test**: After `podman compose up -d`, confirm mcp-grafana is running and can communicate with the Grafana instance (e.g., list datasources via mcp-grafana's tools). + +**Acceptance Scenarios**: + +1. **Given** the compose stack is running, **When** a client connects to the mcp-grafana service, **Then** mcp-grafana responds and can list the Grafana datasources (including the PCP datasource). +2. **Given** the compose stack is running, **When** mcp-grafana is used to create a dashboard, **Then** the dashboard appears in Grafana and can query PCP metrics via the provisioned datasource. + +--- + +### User Story 3 — CI Parity for Grafana Services (Priority: P3) + +The GitHub Actions CI workflow is updated so that the Grafana + mcp-grafana services are available during E2E tests, maintaining the local/CI parity rule documented in CLAUDE.md. + +**Why this priority**: Grafana integration tests that pass locally but fail in CI are worse than no tests. Parity is a project convention and must be maintained, but it can follow the initial compose work. + +**Independent Test**: CI E2E job starts the Grafana services and a smoke test confirms the PCP datasource is healthy before Grafana-specific E2E tests run. + +**Acceptance Scenarios**: + +1. **Given** a CI run triggers the E2E job, **When** the Grafana-related services start, **Then** the PCP datasource health check passes before E2E tests execute. +2. **Given** the compose file adds new services, **When** a developer compares `docker-compose.yml` with the CI workflow, **Then** the same service topology is reflected in both. + +--- + +### Edge Cases + +- What happens when Grafana starts before pmproxy is ready? The PCP datasource health check should fail initially and recover once pmproxy is available — Grafana retries datasource connections. +- What happens when the PCP datasource plugin is not available in the Grafana image? The compose configuration must ensure the plugin is installed (via `GF_INSTALL_PLUGINS` environment variable or a custom image). +- What happens when mcp-grafana cannot reach Grafana? mcp-grafana should report connection errors clearly; compose `depends_on` with a Grafana healthcheck ensures ordering. +- What happens when `podman compose down --volumes` is run? All Grafana state (dashboards, preferences) is ephemeral and destroyed — this is expected for a dev/test stack. + +## Requirements *(mandatory)* + +### Functional Requirements + +- **FR-001**: The compose stack MUST include a Grafana service accessible on a well-known port (`3000`). +- **FR-002**: The Grafana service MUST have the `performancecopilot-pcp-app` datasource plugin installed and enabled. +- **FR-003**: The PCP datasource MUST be auto-provisioned (via Grafana provisioning files) pointing at the compose stack's pmproxy instance (`http://pcp:44322`). +- **FR-004**: The Grafana service MUST have a healthcheck that confirms Grafana is ready to serve requests. +- **FR-005**: The Grafana service MUST depend on the `pcp` service so the datasource has a backend to connect to. +- **FR-006**: The compose stack MUST include an mcp-grafana service configured to connect to the Grafana instance. +- **FR-007**: The mcp-grafana service MUST depend on the Grafana service being healthy before starting. +- **FR-008**: Grafana MUST use anonymous authentication with admin privileges for the dev/test stack (no login required). +- **FR-009**: The PCP datasource provisioning MUST use the same pmproxy URL that pmmcp is configured with, ensuring a single source of truth. +- **FR-010**: The CI workflow MUST be updated to include Grafana and mcp-grafana services matching the compose topology. + +### Key Entities + +- **Grafana Service**: The visualisation platform, pre-configured with the PCP datasource plugin and provisioned datasource. Ephemeral state (no persistent volumes for dev/test). +- **PCP Datasource**: A Grafana datasource of type `performancecopilot-pcp-app` that connects to pmproxy, auto-provisioned via Grafana's file-based provisioning. +- **mcp-grafana Service**: The Grafana MCP server ([grafana/mcp-grafana](https://github.com/grafana/mcp-grafana)) providing Claude with tools for dashboard CRUD, queries, annotations, and deeplinks. + +## Success Criteria *(mandatory)* + +### Measurable Outcomes + +- **SC-001**: A fresh `podman compose up -d` results in all services healthy within 90 seconds, including Grafana with a working PCP datasource. +- **SC-002**: The PCP datasource in Grafana can query at least one metric that is also queryable via pmmcp tools (proving shared backend). +- **SC-003**: mcp-grafana can list datasources and create a test dashboard programmatically after compose stack startup. +- **SC-004**: CI E2E tests that exercise Grafana services pass with the same success rate as local runs. +- **SC-005**: No existing E2E tests break as a result of adding the new services. + +## Assumptions + +- The `performancecopilot-pcp-app` Grafana plugin is available for installation via `GF_INSTALL_PLUGINS` or is bundled in a suitable Grafana image. +- The `grafana/mcp-grafana` project publishes a container image (or can be built from source) suitable for inclusion in the compose stack. +- Anonymous admin access is acceptable for a local dev/test compose stack (not a production configuration). +- Grafana state is ephemeral — no persistent volumes needed; dashboards are recreated as needed. +- The mcp-grafana service uses SSE or stdio transport; compose configuration will expose the appropriate port/protocol. diff --git a/specs/012-grafana-compose/tasks.md b/specs/012-grafana-compose/tasks.md new file mode 100644 index 0000000..179538e --- /dev/null +++ b/specs/012-grafana-compose/tasks.md @@ -0,0 +1,187 @@ +# Tasks: Grafana & mcp-grafana Compose Integration + +**Input**: Design documents from `/specs/012-grafana-compose/` +**Prerequisites**: plan.md (required), spec.md (required for user stories), research.md, data-model.md, contracts/ + +**Tests**: Per the project constitution (Principle II — Testing Standards, NON-NEGOTIABLE), tests +are mandatory. This feature has no new Python application code, but E2E tests MUST verify that the +compose infrastructure works correctly. E2E pytest tests will use `httpx` to verify Grafana and +mcp-grafana service health after compose stack startup. + +**Organization**: Tasks are grouped by user story to enable independent implementation and testing of each story. + +## Format: `[ID] [P?] [Story] Description` + +- **[P]**: Can run in parallel (different files, no dependencies) +- **[Story]**: Which user story this task belongs to (e.g., US1, US2, US3) +- Include exact file paths in descriptions + +## Phase 1: Setup (Shared Infrastructure) + +**Purpose**: Create provisioning directory structure and files needed by Grafana + +- [x] T001 Create provisioning directory at grafana/provisioning/datasources/ +- [x] T002 Create PCP datasource provisioning file at grafana/provisioning/datasources/pcp.yaml with Valkey + Vector datasources pointing at http://pcp:44322 (per research.md R4) + +--- + +## Phase 2: Foundational (Blocking Prerequisites) + +**Purpose**: No blocking prerequisites — provisioning file is created in Phase 1, compose changes are per-story + +**⚠️ CRITICAL**: Phase 1 must be complete before user stories begin (provisioning file is mounted by the Grafana service) + +**Checkpoint**: Provisioning files ready — user story implementation can now begin + +--- + +## Phase 3: User Story 1 — Grafana with PCP Datasource (Priority: P1) 🎯 MVP + +**Goal**: `podman compose up -d` gives a fully wired Grafana instance with PCP datasources auto-provisioned + +**Independent Test**: After compose up, Grafana API at `http://localhost:3000/api/datasources` returns the PCP Valkey and Vector datasources with healthy status + +### Tests for User Story 1 *(required per Principle II — Testing Standards)* + +> **NOTE: Write these tests FIRST, ensure they FAIL before implementation (Red-Green-Refactor)** + +- [x] T003 [P] [US1] E2E test: Grafana accessible and healthy — verify GET http://localhost:3000/api/health returns 200 in tests/e2e/test_grafana.py +- [x] T004 [P] [US1] E2E test: PCP datasources provisioned — verify GET http://localhost:3000/api/datasources returns both "PCP Valkey" and "PCP Vector" entries in tests/e2e/test_grafana.py +- [x] T005 [P] [US1] E2E test: PCP Valkey datasource healthy — verify datasource health check endpoint confirms pmproxy connectivity in tests/e2e/test_grafana.py + +### Implementation for User Story 1 + +- [x] T006 [US1] Add `grafana` service to docker-compose.yml with image `grafana/grafana:latest`, port 3000, GF_INSTALL_PLUGINS (ZIP URL per research.md R3), GF_PLUGINS_ALLOW_LOADING_UNSIGNED_PLUGINS, anonymous auth env vars (per research.md R5), healthcheck (`curl -sf http://localhost:3000/api/health`), volume mount `./grafana/provisioning:/etc/grafana/provisioning:ro`, depends_on `pcp` service_started +- [x] T007 [US1] Verify: run `podman compose up -d --wait`, confirm Grafana container is healthy, run E2E tests from T003-T005 + +**Checkpoint**: Grafana is accessible with working PCP datasources — MVP complete + +--- + +## Phase 4: User Story 2 — mcp-grafana Service (Priority: P2) + +**Goal**: mcp-grafana runs alongside Grafana, connected via basic auth, accessible on port 8000/sse + +**Independent Test**: After compose up, mcp-grafana SSE endpoint at `http://localhost:8000/sse` responds + +### Tests for User Story 2 *(required per Principle II — Testing Standards)* + +> **NOTE: Write these tests FIRST, ensure they FAIL before implementation (Red-Green-Refactor)** + +- [x] T008 [P] [US2] E2E test: mcp-grafana SSE endpoint responds — verify GET http://localhost:8000/sse returns a valid SSE connection in tests/e2e/test_grafana.py + +### Implementation for User Story 2 + +- [x] T009 [US2] Add `mcp-grafana` service to docker-compose.yml with image `mcp/grafana`, port 8000, GRAFANA_URL=http://grafana:3000, GRAFANA_USERNAME=admin, GRAFANA_PASSWORD=admin, depends_on `grafana` service_healthy +- [x] T010 [US2] Verify: run `podman compose up -d --wait`, confirm mcp-grafana container is running, run E2E test from T008 + +**Checkpoint**: Both Grafana and mcp-grafana are running and wired together + +--- + +## Phase 5: User Story 3 — CI Parity (Priority: P3) + +**Goal**: GitHub Actions E2E job starts Grafana services and verifies them before running E2E tests + +**Independent Test**: Push branch, CI E2E job starts compose stack with Grafana services, existing E2E tests pass plus new Grafana smoke tests pass + +### Tests for User Story 3 *(required per Principle II — Testing Standards)* + +> **NOTE: CI parity is validated by the CI run itself — no additional pytest tests needed beyond those already written in US1/US2. The "test" here is that CI passes.** + +### Implementation for User Story 3 + +- [x] T011 [US3] Add "Wait for Grafana" step to .github/workflows/ci.yml E2E job after the "Wait for pmproxy" step — poll `http://localhost:3000/api/health` until healthy (same pattern as pmproxy wait) +- [x] T012 [US3] Add `GRAFANA_URL: http://localhost:3000` to the env section of the E2E job in .github/workflows/ci.yml +- [x] T013 [US3] Verify: run existing E2E tests locally with compose stack to confirm no regressions (SC-005) + +**Checkpoint**: CI workflow includes Grafana services and passes all tests + +--- + +## Phase 6: Polish & Cross-Cutting Concerns + +**Purpose**: Documentation updates and final validation + +- [x] T014 [P] Update README.md — add Grafana and mcp-grafana to the services table, document ports (3000, 8000), add Grafana-specific quickstart section +- [x] T015 [P] Update CLAUDE.md — add Grafana compose gotchas (unsigned plugin, basic auth requirement, GF_INSTALL_PLUGINS ZIP URL pattern) to "E2E Container Gotchas" section +- [x] T016 [P] Add inline comments to docker-compose.yml for the new grafana and mcp-grafana services explaining key env vars and the auth decision +- [x] T017 Run quickstart.md validation — execute all steps from specs/012-grafana-compose/quickstart.md against a fresh compose stack and confirm they work + +--- + +## Dependencies & Execution Order + +### Phase Dependencies + +- **Setup (Phase 1)**: No dependencies — can start immediately +- **Foundational (Phase 2)**: N/A — Phase 1 covers setup +- **User Story 1 (Phase 3)**: Depends on Phase 1 (provisioning files exist) +- **User Story 2 (Phase 4)**: Depends on US1 (mcp-grafana needs the grafana service to exist in compose) +- **User Story 3 (Phase 5)**: Depends on US1 + US2 (CI must test the full compose topology) +- **Polish (Phase 6)**: Depends on all user stories being complete + +### User Story Dependencies + +- **User Story 1 (P1)**: Can start after Phase 1 — no dependencies on other stories +- **User Story 2 (P2)**: Depends on US1 — mcp-grafana `depends_on` the grafana service +- **User Story 3 (P3)**: Depends on US1 + US2 — CI must reflect the complete compose topology + +### Within Each User Story + +- Tests MUST be written and FAIL before implementation +- Write compose service definition +- Verify with `podman compose up -d --wait` + run E2E tests +- Commit after each task or logical group + +### Parallel Opportunities + +- T003, T004, T005 (US1 tests) can run in parallel — different test functions, same file +- T014, T015, T016 (Polish docs) can run in parallel — different files +- US1 and US2 tests (T003-T005, T008) can be written in parallel before any implementation + +--- + +## Parallel Example: User Story 1 + +```bash +# Write all US1 tests in parallel (they go in the same file but are independent functions): +Task: "E2E test: Grafana accessible and healthy in tests/e2e/test_grafana.py" +Task: "E2E test: PCP datasources provisioned in tests/e2e/test_grafana.py" +Task: "E2E test: PCP Valkey datasource healthy in tests/e2e/test_grafana.py" + +# Then implement (sequential — single file): +Task: "Add grafana service to docker-compose.yml" +Task: "Verify: compose up + run E2E tests" +``` + +--- + +## Implementation Strategy + +### MVP First (User Story 1 Only) + +1. Complete Phase 1: Create provisioning files +2. Complete Phase 3: Grafana + PCP datasources in compose +3. **STOP and VALIDATE**: `podman compose up -d`, open http://localhost:3000, verify datasources +4. Commit and push — MVP is usable + +### Incremental Delivery + +1. Phase 1 (Setup) → provisioning files committed +2. US1 (Grafana + PCP datasources) → compose up gives working Grafana → commit + push +3. US2 (mcp-grafana) → compose up gives both services → commit + push +4. US3 (CI parity) → CI passes with full topology → commit + push +5. Polish → docs updated, quickstart validated → commit + push + +--- + +## Notes + +- [P] tasks = different files, no dependencies +- [Story] label maps task to specific user story for traceability +- This feature has **no new Python application code** — all tasks are infrastructure (YAML, CI workflow) +- E2E tests use `httpx` to hit Grafana/mcp-grafana REST APIs from within the existing E2E test framework +- E2E tests are gated by `PMPROXY_URL` env var (same as existing E2E tests) plus new `GRAFANA_URL` +- The PCP plugin is **unsigned** — requires both `GF_INSTALL_PLUGINS` with ZIP URL and `GF_PLUGINS_ALLOW_LOADING_UNSIGNED_PLUGINS` +- Commit after each task or logical group; follow story-by-story development loop from CLAUDE.md diff --git a/src/pmmcp/config.py b/src/pmmcp/config.py index bc561f5..c7fd98a 100644 --- a/src/pmmcp/config.py +++ b/src/pmmcp/config.py @@ -26,5 +26,7 @@ class ServerConfig(BaseSettings): transport: Literal["stdio", "streamable-http"] = "stdio" host: str = "127.0.0.1" port: int = 8080 + grafana_folder: str = "pmmcp-triage" + report_dir: Path = Path("~/.pmmcp/reports") model_config = {"env_prefix": "PMMCP_"} diff --git a/src/pmmcp/prompts/coordinator.py b/src/pmmcp/prompts/coordinator.py index 69d9161..64f00be 100644 --- a/src/pmmcp/prompts/coordinator.py +++ b/src/pmmcp/prompts/coordinator.py @@ -149,6 +149,41 @@ def _coordinate_investigation_impl( - Process: ... - Cross-cutting: ... ``` + +## Phase 3 — Visualisation + +After synthesis, create a visual record of the investigation: + +### Grafana Dashboard (preferred) + +If `mcp-grafana` tools are available in this session: + +1. Create (or find) a folder named **`pmmcp-triage`** using `mcp-grafana.search_folders` / \ +`mcp-grafana.create_folder`. +2. Create a dashboard using `mcp-grafana.update_dashboard`: + - **Title**: `YYYY-MM-DD ` (e.g., `2026-03-10 memory cascade saas-prod-01`) + - **Tags**: always include `pmmcp-generated` + - **Folder**: `pmmcp-triage` + - **Panels**: one panel per key finding — memory, swap, CPU, disk, network as relevant. \ +Use the PCP datasource UID discovered during session preflight. +3. Call `mcp-grafana.generate_deeplink` and return the dashboard URL to the user. + +### HTML Fallback + +If mcp-grafana is unavailable, offer to generate a self-contained HTML report: +- Save to the configured report directory (default `~/.pmmcp/reports/`) +- Name: `YYYY-MM-DD-.html` +- Include investigation summary, data tables, and narrative + +### Text Fallback + +If the user declines both, the synthesised text output from Phase 2 stands on its own. + +### Auto-Trigger Heuristic + +If your investigation has surfaced findings across 3+ metrics or 2+ subsystems, and you \ +have not already created a visualisation, proactively offer to create a dashboard — don't \ +wait to be asked. """ return [{"role": "user", "content": content}] diff --git a/src/pmmcp/prompts/session_init.py b/src/pmmcp/prompts/session_init.py index 2802c0b..9ff73fe 100644 --- a/src/pmmcp/prompts/session_init.py +++ b/src/pmmcp/prompts/session_init.py @@ -51,9 +51,27 @@ def _session_init_impl( content = f"""\ You are initialising a PCP monitoring session{host_clause}{timerange_clause}. -Register the following canonical derived metrics using `pcp_derive_metric`, \ -then verify each is available with `pcp_fetch_live`. Report results without aborting \ -if any metric fails — note failures and continue. +## IMPORTANT — Investigation Entry Point + +**ALWAYS** use the `coordinate_investigation` prompt for broad performance investigations. \ +It dispatches 6 specialist sub-agents (cpu, memory, disk, network, process, crosscutting) \ +in parallel, then synthesises findings into a unified root-cause narrative with \ +cross-subsystem correlation. **Do NOT** call individual tools (`pcp_fetch_timeseries`, \ +`pcp_detect_anomalies`, etc.) or specialist prompts directly unless you have a specific, \ +targeted question about a single known metric. + +## Grafana Preflight — Datasource Discovery + +Before any investigation, check whether Grafana visualisation is available: + +1. Call `mcp-grafana.list_datasources` to enumerate configured datasources. +2. Look for a datasource of type `performancecopilot-valkey-datasource` or \ +`performancecopilot-vector-datasource`. +3. If found, note its **UID** and **URL** for later dashboard creation. +4. **Match**: Grafana features enabled — investigation prompts will create dashboards \ +in the `pmmcp-triage` folder, named `YYYY-MM-DD `, tagged `pmmcp-generated`. +5. **No match / no mcp-grafana**: Grafana unavailable — fall back to text output or \ +offer an HTML report. This is not an error; pmmcp works fully standalone. ## Step 1 — Register Derived Metrics @@ -87,13 +105,6 @@ def _session_init_impl( **Do not abort if one or more verifications fail.** Report which metrics are available \ and which are not, so downstream investigations know which derived metrics can be used. - -## Next Step — Investigation - -For broad performance investigations, use the `coordinate_investigation` prompt. \ -It dispatches 6 specialist sub-agents (cpu, memory, disk, network, process, crosscutting) \ -in parallel, then synthesises findings into a unified root-cause narrative. \ -This is the recommended entry point for any "something is wrong" investigation. """ return [{"role": "user", "content": content}] diff --git a/src/pmmcp/prompts/specialist.py b/src/pmmcp/prompts/specialist.py index cc9edac..a7262cd 100644 --- a/src/pmmcp/prompts/specialist.py +++ b/src/pmmcp/prompts/specialist.py @@ -267,6 +267,10 @@ def specialist_investigate( ) -> list[dict]: """Deep domain-expert investigation for a specific subsystem. + Typically dispatched by ``coordinate_investigation`` as part of a parallel + 6-specialist sweep. For broad 'something is wrong' investigations, start + with the coordinator prompt instead of calling this directly. + Each subsystem (cpu, memory, disk, network, process, crosscutting) carries concrete investigation heuristics, metric relationships, and interpretation guidance from an experienced performance engineer. diff --git a/src/pmmcp/tools/anomaly.py b/src/pmmcp/tools/anomaly.py index 48993fe..6c00ab6 100644 --- a/src/pmmcp/tools/anomaly.py +++ b/src/pmmcp/tools/anomaly.py @@ -140,6 +140,9 @@ async def pcp_detect_anomalies( z_score_threshold: Z-score above which a deviation is anomalous (default 2.0) host: Target hostname (empty means all hosts) interval: Sampling interval ('auto' selects based on baseline window duration) + + Note: For broad investigations, start with the ``coordinate_investigation`` prompt + rather than running anomaly detection directly — it orchestrates a full multi-subsystem sweep. """ return await _detect_anomalies_impl( get_client(), diff --git a/src/pmmcp/tools/investigate.py b/src/pmmcp/tools/investigate.py index 4e7e701..e2cab8f 100644 --- a/src/pmmcp/tools/investigate.py +++ b/src/pmmcp/tools/investigate.py @@ -204,6 +204,10 @@ async def pcp_quick_investigate( baseline_days: Number of days before the comparison window to use as the baseline for anomaly detection. host: Target host. Empty = default pmproxy host. + + Note: For broad 'something is wrong' investigations spanning multiple subsystems, + prefer the ``coordinate_investigation`` prompt — it dispatches 6 specialist + sub-agents in parallel for comprehensive coverage. """ return await _quick_investigate_impl( get_client(), diff --git a/src/pmmcp/tools/scanning.py b/src/pmmcp/tools/scanning.py index a66ba8a..e727227 100644 --- a/src/pmmcp/tools/scanning.py +++ b/src/pmmcp/tools/scanning.py @@ -144,6 +144,9 @@ async def pcp_scan_changes( max_metrics: Maximum number of changed metrics to return (default 50). For exploration use 50; increase to 200+ for full scan. interval: Sampling interval ('auto' selects based on baseline window duration) + + Note: For broad investigations, start with the ``coordinate_investigation`` prompt + rather than scanning changes directly — it orchestrates a full multi-subsystem sweep. """ return await _scan_changes_impl( get_client(), diff --git a/src/pmmcp/tools/timeseries.py b/src/pmmcp/tools/timeseries.py index b35348c..4b6aa92 100644 --- a/src/pmmcp/tools/timeseries.py +++ b/src/pmmcp/tools/timeseries.py @@ -154,6 +154,9 @@ async def pcp_fetch_timeseries( limit: Maximum data points per metric/instance (default 500) offset: Pagination offset expr: Raw PCP series expression (overrides names if provided) + + Note: For broad investigations, start with the ``coordinate_investigation`` prompt + rather than fetching metrics directly — it orchestrates a full multi-subsystem sweep. """ return await _fetch_timeseries_impl( get_client(), diff --git a/tests/e2e/test_grafana.py b/tests/e2e/test_grafana.py new file mode 100644 index 0000000..9228508 --- /dev/null +++ b/tests/e2e/test_grafana.py @@ -0,0 +1,90 @@ +"""E2E tests for Grafana and mcp-grafana compose services. + +Requires: GRAFANA_URL set and a running Grafana + PCP stack. +Gating: tests skip when GRAFANA_URL is not set. + +Run locally: + podman compose up -d --wait + PMPROXY_URL=http://localhost:44322 GRAFANA_URL=http://localhost:3000 \ + uv run pytest tests/e2e/test_grafana.py -m e2e +""" + +from __future__ import annotations + +import os + +import httpx +import pytest + +GRAFANA_URL = os.environ.get("GRAFANA_URL") + +pytestmark = [ + pytest.mark.e2e, + pytest.mark.skipif(not GRAFANA_URL, reason="GRAFANA_URL not set"), +] + + +# --------------------------------------------------------------------------- +# US1: Grafana with PCP datasources +# --------------------------------------------------------------------------- + + +def test_grafana_health(): + """T003 — Grafana is accessible and healthy.""" + resp = httpx.get(f"{GRAFANA_URL}/api/health", timeout=10) + assert resp.status_code == 200 + body = resp.json() + assert body.get("database") == "ok" + + +def test_pcp_datasources_provisioned(): + """T004 — PCP Valkey and PCP Vector datasources are provisioned.""" + resp = httpx.get(f"{GRAFANA_URL}/api/datasources", timeout=10) + assert resp.status_code == 200 + datasources = resp.json() + names = {ds["name"] for ds in datasources} + assert "PCP Valkey" in names, f"Missing PCP Valkey; got {names}" + assert "PCP Vector" in names, f"Missing PCP Vector; got {names}" + + +def test_pcp_valkey_datasource_healthy(): + """T005 — PCP Valkey datasource can reach pmproxy.""" + # First get the datasource ID + resp = httpx.get(f"{GRAFANA_URL}/api/datasources", timeout=10) + assert resp.status_code == 200 + datasources = resp.json() + valkey_ds = next((ds for ds in datasources if ds["name"] == "PCP Valkey"), None) + assert valkey_ds is not None, "PCP Valkey datasource not found" + + # Health check via datasource proxy — tests pmproxy connectivity + ds_id = valkey_ds["id"] + health_resp = httpx.get( + f"{GRAFANA_URL}/api/datasources/{ds_id}/health", + timeout=15, + ) + # 200 = healthy, but even a non-500 confirms Grafana can talk to the datasource + assert health_resp.status_code == 200, ( + f"Datasource health check failed: {health_resp.status_code} {health_resp.text}" + ) + + +# --------------------------------------------------------------------------- +# US2: mcp-grafana SSE endpoint +# --------------------------------------------------------------------------- + +MCP_GRAFANA_URL = os.environ.get("MCP_GRAFANA_URL", "http://localhost:8000") + + +@pytest.mark.skipif( + not os.environ.get("MCP_GRAFANA_URL") and not GRAFANA_URL, + reason="MCP_GRAFANA_URL not set", +) +def test_mcp_grafana_sse_responds(): + """T008 — mcp-grafana SSE endpoint accepts connections.""" + # SSE endpoint should return a streaming response; we just verify it connects + with httpx.stream("GET", f"{MCP_GRAFANA_URL}/sse", timeout=10) as resp: + assert resp.status_code == 200 + content_type = resp.headers.get("content-type", "") + assert "text/event-stream" in content_type, ( + f"Expected text/event-stream, got {content_type}" + ) diff --git a/tests/unit/test_config.py b/tests/unit/test_config.py new file mode 100644 index 0000000..5dca3b7 --- /dev/null +++ b/tests/unit/test_config.py @@ -0,0 +1,33 @@ +"""Unit tests for config fields.""" + +from __future__ import annotations + +from pathlib import Path + +from pmmcp.config import ServerConfig + + +def test_server_config_grafana_folder_default(): + """PMMCP_GRAFANA_FOLDER defaults to 'pmmcp-triage'.""" + cfg = ServerConfig() + assert cfg.grafana_folder == "pmmcp-triage" + + +def test_server_config_grafana_folder_env_override(monkeypatch): + """PMMCP_GRAFANA_FOLDER can be overridden via env var.""" + monkeypatch.setenv("PMMCP_GRAFANA_FOLDER", "my-triage") + cfg = ServerConfig() + assert cfg.grafana_folder == "my-triage" + + +def test_server_config_report_dir_default(): + """PMMCP_REPORT_DIR defaults to ~/.pmmcp/reports/.""" + cfg = ServerConfig() + assert cfg.report_dir == Path("~/.pmmcp/reports") + + +def test_server_config_report_dir_env_override(monkeypatch): + """PMMCP_REPORT_DIR can be overridden via env var.""" + monkeypatch.setenv("PMMCP_REPORT_DIR", "/tmp/reports") + cfg = ServerConfig() + assert cfg.report_dir == Path("/tmp/reports") diff --git a/tests/unit/test_prompts_coordinator.py b/tests/unit/test_prompts_coordinator.py index 58d7dbf..670c704 100644 --- a/tests/unit/test_prompts_coordinator.py +++ b/tests/unit/test_prompts_coordinator.py @@ -165,3 +165,47 @@ def test_coordinator_recurring_pattern_highlight(): assert "RECURRING" in text or "recurring" in text, ( "Coordinator missing recurring pattern reference" ) + + +# --------------------------------------------------------------------------- +# Phase 3 visualisation tests +# --------------------------------------------------------------------------- + + +def test_coordinator_phase3_visualisation(): + """Coordinator includes Phase 3 visualisation guidance.""" + from pmmcp.prompts.coordinator import _coordinate_investigation_impl + + text = _coordinate_investigation_impl(request="app is slow")[0]["content"] + assert "Phase 3" in text or "phase 3" in text.lower(), "Coordinator missing Phase 3" + assert "dashboard" in text.lower(), "Coordinator Phase 3 must reference dashboard creation" + + +def test_coordinator_grafana_conventions(): + """Coordinator includes Grafana dashboard conventions from issue #10.""" + from pmmcp.prompts.coordinator import _coordinate_investigation_impl + + text = _coordinate_investigation_impl(request="app is slow")[0]["content"] + assert "pmmcp-triage" in text, "Missing folder convention" + assert "pmmcp-generated" in text, "Missing tag convention" + assert "YYYY-MM-DD" in text, "Missing naming convention" + + +def test_coordinator_visualisation_fallback_cascade(): + """Coordinator includes fallback cascade (Grafana -> HTML -> text).""" + from pmmcp.prompts.coordinator import _coordinate_investigation_impl + + text = _coordinate_investigation_impl(request="app is slow")[0]["content"].lower() + assert "fallback" in text or "unavailable" in text, ( + "Coordinator missing visualisation fallback cascade" + ) + + +def test_coordinator_deeplink_guidance(): + """Coordinator instructs returning a deeplink after dashboard creation.""" + from pmmcp.prompts.coordinator import _coordinate_investigation_impl + + text = _coordinate_investigation_impl(request="app is slow")[0]["content"].lower() + assert "deeplink" in text or "deep link" in text or "url" in text, ( + "Coordinator must instruct returning dashboard URL/deeplink" + ) diff --git a/tests/unit/test_prompts_session_init.py b/tests/unit/test_prompts_session_init.py index e8d6677..ab56438 100644 --- a/tests/unit/test_prompts_session_init.py +++ b/tests/unit/test_prompts_session_init.py @@ -99,3 +99,45 @@ def test_session_init_references_coordinate_investigation(): assert "coordinate_investigation" in full_text, ( "session_init must reference coordinate_investigation as investigation entry point" ) + + +def test_session_init_assertive_coordinator_language(): + """session_init uses assertive language (ALWAYS/DO NOT) for coordinator guidance.""" + result = _session_init_impl() + full_text = " ".join(msg["content"] for msg in result) + upper_text = full_text.upper() + assert "ALWAYS" in upper_text or "DO NOT" in upper_text or "MUST" in upper_text, ( + "session_init must use assertive language directing to coordinate_investigation" + ) + + +def test_session_init_coordinator_before_derived_metrics(): + """Coordinator guidance appears before the derived metrics registration steps.""" + result = _session_init_impl() + full_text = " ".join(msg["content"] for msg in result) + coord_pos = full_text.find("coordinate_investigation") + derive_pos = full_text.find("Step 1") + assert coord_pos < derive_pos, ( + "Coordinator guidance must appear before Step 1 (derived metrics)" + ) + + +def test_session_init_grafana_preflight_references(): + """session_init includes Grafana preflight discovery workflow.""" + result = _session_init_impl() + full_text = " ".join(msg["content"] for msg in result) + assert "list_datasources" in full_text, ( + "session_init must reference list_datasources for Grafana preflight" + ) + assert "performancecopilot" in full_text.lower(), ( + "session_init must reference PCP datasource type for validation" + ) + + +def test_session_init_grafana_fallback_cascade(): + """session_init includes fallback cascade when Grafana is unavailable.""" + result = _session_init_impl() + full_text = " ".join(msg["content"] for msg in result).lower() + assert "fallback" in full_text or "unavailable" in full_text, ( + "session_init must describe fallback when Grafana is unavailable" + ) diff --git a/tests/unit/test_prompts_specialist.py b/tests/unit/test_prompts_specialist.py index 6712c4f..65599a1 100644 --- a/tests/unit/test_prompts_specialist.py +++ b/tests/unit/test_prompts_specialist.py @@ -415,3 +415,13 @@ def test_graceful_degradation_report_limitation(): assert "insufficient baseline" in text, ( f"{sub}: missing 'insufficient baseline' limitation note" ) + + +def test_specialist_docstring_references_coordinator(): + """specialist_investigate docstring references coordinate_investigation as parent.""" + from pmmcp.prompts.specialist import specialist_investigate + + docstring = specialist_investigate.__doc__ + assert "coordinate_investigation" in docstring, ( + "specialist_investigate docstring must reference coordinate_investigation" + )