Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
d497407
feat: add Grafana PCP datasource provisioning files
tallpsmith Mar 10, 2026
92a0073
test: add Grafana and mcp-grafana E2E tests (RED)
tallpsmith Mar 10, 2026
19f03f6
feat: add grafana + mcp-grafana services to compose stack
tallpsmith Mar 10, 2026
b2e2765
feat: add Grafana wait step and GRAFANA_URL to CI E2E job
tallpsmith Mar 10, 2026
2456f89
docs: add Grafana services to README and CLAUDE.md gotchas
tallpsmith Mar 10, 2026
fcdc516
docs: add 012-grafana-compose spec artifacts
tallpsmith Mar 10, 2026
5445063
fix: pmmcp healthcheck — use CMD-SHELL to avoid semicolon splitting
tallpsmith Mar 10, 2026
83d5982
docs: add podman CMD-SHELL semicolon gotcha to CLAUDE.md
tallpsmith Mar 10, 2026
d650ade
docs: mark all 012-grafana-compose tasks complete
tallpsmith Mar 10, 2026
df6717c
fix: make e2e mandatory in pre-commit and add Grafana env vars
tallpsmith Mar 10, 2026
450b2ec
docs: add VM-aware pre-push guidance to CLAUDE.md
tallpsmith Mar 10, 2026
d1375fa
docs: investigation hierarchy guardrails design spec
tallpsmith Mar 10, 2026
5ccc7aa
docs: fix reviewer issues in design spec and implementation plan
tallpsmith Mar 10, 2026
d0e36ef
feat: add grafana_folder and report_dir config fields
tallpsmith Mar 10, 2026
97043eb
feat: add coordinator breadcrumbs to tool docstrings
tallpsmith Mar 10, 2026
0f632bb
feat: assertive coordinator guidance + Grafana preflight in session_init
tallpsmith Mar 10, 2026
d40b3e5
feat: add hierarchy context to specialist_investigate docstring
tallpsmith Mar 10, 2026
134815e
feat: add Phase 3 visualisation to coordinator prompt
tallpsmith Mar 10, 2026
125d5eb
docs: add investigation hierarchy, Grafana conventions, and new env vars
tallpsmith Mar 10, 2026
755699f
style: auto-format test_prompts_coordinator.py
tallpsmith Mar 10, 2026
46daca6
Add teardown and `doit` to Justfile
tallpsmith Mar 10, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,7 @@ jobs:
timeout-minutes: 10
env:
PMPROXY_URL: http://localhost:44322
GRAFANA_URL: http://localhost:3000

steps:
- uses: actions/checkout@v4
Expand All @@ -125,6 +126,14 @@ jobs:
sleep 2
done

- name: Wait for Grafana
run: |
echo "Waiting for Grafana..."
for i in $(seq 1 30); do
curl -sf http://localhost:3000/api/health && break
sleep 2
done

- name: Test (E2E)
run: uv run pytest -m e2e --junitxml=results-e2e.xml

Expand Down
52 changes: 49 additions & 3 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,39 @@ podman compose down

- PCP image **requires `privileged: true`** — it uses systemd as PID 1; without it the container exits immediately (code 255)
- Redis host env var is **`KEY_SERVERS: redis-stack:6379`** (NOT `PCP_REDIS_HOST`) — that's what the container entrypoint reads; wrong value causes pmproxy to hang on all series/search calls
- **Podman splits `CMD` array args on semicolons** — Python one-liners with `;` get mangled. Always use `CMD-SHELL` for healthchecks containing semicolons: `["CMD-SHELL", "python -c 'import foo; foo.bar()'"]`

## Grafana Compose Gotchas

- PCP plugin is **unsigned** — must use `GF_INSTALL_PLUGINS` with the GitHub release ZIP URL, not the Grafana catalog shorthand
- All PCP sub-plugins must be listed in `GF_PLUGINS_ALLOW_LOADING_UNSIGNED_PLUGINS` (app, valkey-datasource, vector-datasource, bpftrace-datasource, flamegraph-panel, breadcrumbs-panel, troubleshooting-panel)
- mcp-grafana **requires authentication** — it doesn't support anonymous access. Basic auth (`admin/admin`) is simplest for the dev stack
- Grafana healthcheck uses `curl -sf http://localhost:3000/api/health` — the container must have curl installed (official image does)
- Datasources are auto-provisioned from `grafana/provisioning/datasources/pcp.yaml` — mounted read-only into the container

## Grafana Dashboard Conventions (Investigation Output)

When creating dashboards as part of an investigation:

| Convention | Value |
|-----------|-------|
| Folder | `pmmcp-triage` (configurable via `PMMCP_GRAFANA_FOLDER`) |
| Naming | `YYYY-MM-DD <short summary>` (e.g., `2026-03-10 memory cascade saas-prod-01`) |
| Tagging | Always include `pmmcp-generated` |
| Deeplink | After creation, call `generate_deeplink` and return URL to user |
| Auto-trigger | Offer visualisation when findings span 3+ metrics or 2+ subsystems |

## Investigation Prompt Hierarchy

The investigation prompt hierarchy is:

```
session_init → coordinate_investigation → specialist_investigate (×6)
```

- **ALWAYS** start broad investigations with `coordinate_investigation`
- **DO NOT** call raw tools (`pcp_fetch_timeseries`, `pcp_detect_anomalies`) directly for open-ended investigations
- Specialist prompts are dispatched by the coordinator — don't call them directly unless targeting a specific subsystem

## CI / Local E2E Parity — CRITICAL

Expand Down Expand Up @@ -211,16 +244,27 @@ Rules:

**Mandatory before any `git push`** (required by Constitution v1.2.0, Principle II).

Run either:
The full check runs: lint → format → unit+integration tests (≥80% coverage) → E2E tests (compose stack + container healthchecks).

**If you are Claude running in a VM** (no podman/docker available):
- Run `just ci` as a minimum — this covers lint, format, and unit+integration tests
- Do **not** attempt `pre-push-sanity.sh`, `just e2e`, or any `podman compose` commands — they will fail without a container runtime
- Prompt the user to run the full suite on their host before pushing:
```
⚠️ I've run `just ci` (lint + tests) — all green.
Please run `./pre-commit.sh` or `just e2e` on your host to complete E2E validation before pushing.
```

**If you have container access** (or the user is running directly):
```bash
scripts/pre-push-sanity.sh
./pre-commit.sh
```
or invoke the Claude skill:
```
/pre-push-sanity
```

The check runs in order: lint → format → unit+integration tests (≥80% coverage) → E2E tests (starts compose stack automatically via `just e2e`). E2E is **never skipped** — the compose stack must be buildable and all containers must pass healthchecks before tests run.
E2E is **never skipped** by humans — the compose stack must be buildable and all containers must pass healthchecks before tests run.
<!-- MANUAL ADDITIONS END -->

## Active Technologies
Expand All @@ -232,6 +276,8 @@ The check runs in order: lint → format → unit+integration tests (≥80% cove
- Python 3.11+ + `mcp[cli]` ≥1.2.0 (FastMCP), `pydantic` v2.x — no new dependencies (010-specialist-agents)
- N/A — prompts are stateless text generators (010-specialist-agents)
- Python 3.11+ + `mcp[cli]` ≥1.2.0 (FastMCP), `pydantic` v2.x — no new dependencies (011-specialist-baselining)
- N/A (infrastructure-only; compose YAML, Grafana provisioning YAML) + `grafana/grafana:latest`, `mcp/grafana` (Docker Hub), `performancecopilot-pcp-app` plugin v5.3.0 (012-grafana-compose)
- Ephemeral — no persistent volumes for Grafana (012-grafana-compose)

## Recent Changes
- 002-add-integration-e2e-tests: Added Python 3.11+ + `mcp[cli]` ≥1.26.0 (FastMCP + ClientSession), `anyio` (memory streams), `respx` (already present — mocks httpx for integration tier), `pytest-asyncio` (already present)
Expand Down
10 changes: 9 additions & 1 deletion Justfile
Original file line number Diff line number Diff line change
Expand Up @@ -35,5 +35,13 @@ ci: check test
# Uses --wait to match CI behaviour — all containers must be healthy before tests run
e2e:
PROFILES_DIR=./profiles/e2e podman compose up -d --wait --wait-timeout 120
PMPROXY_URL=http://localhost:44322 uv run python -m pytest -m e2e -q
PMPROXY_URL=http://localhost:44322 GRAFANA_URL=http://localhost:3000 MCP_GRAFANA_URL=http://localhost:8000 uv run python -m pytest -m e2e -q
@echo "Stack still running — run 'podman compose down --volumes' to purge seeded data before next run"

# Brings up the full stack, seeded (not e2e)
doit:
podman compose up -d --wait --wait-timeout 120

# Removes all containers and their volumes for a clean state
teardown:
podman compose down --volumes
18 changes: 14 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,12 +19,14 @@ pmproxy's time-series backend, and has everything ready for Claude to analyse.
podman compose up -d
```

This runs four services in order:
This runs six services in order:

1. **pmlogsynth-generator** — generates PCP archives from `profiles/scenarios/saas-diurnal-week.yml`
2. **redis-stack** — time-series backend (Valkey/Redis, port 6379)
3. **pmlogsynth-seeder** — loads the archives into the time-series store
4. **pcp** — pmcd + pmproxy, ready to serve queries (port 44322)
5. **grafana** — Grafana with PCP plugin and auto-provisioned datasources (port 3000)
6. **mcp-grafana** — MCP server for Grafana, SSE transport (port 8000)

The generator and seeder are one-shot jobs; allow ~30–60 seconds for them to complete.
Check progress with:
Expand All @@ -39,7 +41,13 @@ Once seeded, verify data is queryable:
curl -s "http://localhost:44322/series/query?expr=kernel.all.cpu.user" | head -c 200
```

### 2. Connect pmmcp to Claude Code
### 2. Browse Grafana dashboards

Open http://localhost:3000 — no login required (anonymous admin is enabled). Navigate to **Connections → Data sources** to see the auto-provisioned PCP Valkey (historical) and PCP Vector (live) datasources.

The `mcp-grafana` service exposes a Grafana MCP server at http://localhost:8000/sse for AI agents that need to create dashboards or query Grafana programmatically.

### 3. Connect pmmcp to Claude Code

```bash
git clone <repository-url>
Expand All @@ -62,7 +70,7 @@ Add to `.mcp.json` in your project root (or `~/.claude/mcp.json` for global conf

Restart Claude Code (or `/mcp` to reload) and confirm **pmmcp** appears in the connected servers list.

### 3. Ask Claude to investigate
### 4. Ask Claude to investigate

The seeded dataset is `saas-prod-01` — a simulated production host with a week of
realistic diurnal traffic. Try these to get a feel for what pmmcp can do:
Expand Down Expand Up @@ -93,7 +101,7 @@ Compare the morning peak to the overnight baseline on saas-prod-01 across CPU, m
/investigate_subsystem subsystem=cpu host=saas-prod-01
```

### 4. Tear down when done
### 5. Tear down when done

```bash
podman compose down --volumes
Expand Down Expand Up @@ -189,6 +197,8 @@ See **Running pmmcp** below for all CLI flags and environment variables.
| `PMMCP_TRANSPORT` | `stdio` | MCP transport mode |
| `PMMCP_HOST` | `127.0.0.1` | Bind host for HTTP transport |
| `PMMCP_PORT` | `8080` | Bind port for HTTP transport |
| `PMMCP_GRAFANA_FOLDER` | `pmmcp-triage` | Grafana folder for investigation dashboards |
| `PMMCP_REPORT_DIR` | `~/.pmmcp/reports` | Output directory for HTML fallback reports |

**Precedence:** CLI flag > environment variable > default.

Expand Down
44 changes: 42 additions & 2 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,47 @@ services:
pcp:
condition: service_started
healthcheck:
test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:8080/healthcheck')"]
# Liveness only — confirms HTTP server accepts connections (ignores pmproxy status)
test: ["CMD-SHELL", "python -c 'import socket; socket.create_connection((\"localhost\",8080),2).close()'"]
interval: 10s
timeout: 5s
retries: 3
retries: 6
start_period: 10s

# Grafana with PCP plugin — browse http://localhost:3000 (anonymous admin, no login)
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
# PCP plugin is unsigned — install from GitHub release ZIP and allow all its sub-plugins
GF_INSTALL_PLUGINS: "https://github.com/performancecopilot/grafana-pcp/releases/download/v5.3.0/performancecopilot-pcp-app-5.3.0.zip;performancecopilot-pcp-app"
GF_PLUGINS_ALLOW_LOADING_UNSIGNED_PLUGINS: "performancecopilot-pcp-app,performancecopilot-valkey-datasource,performancecopilot-vector-datasource,performancecopilot-bpftrace-datasource,performancecopilot-flamegraph-panel,performancecopilot-breadcrumbs-panel,performancecopilot-troubleshooting-panel"
# Anonymous admin for browser access; basic auth creds for mcp-grafana API access
GF_AUTH_ANONYMOUS_ENABLED: "true"
GF_AUTH_ANONYMOUS_ORG_ROLE: Admin
GF_SECURITY_ADMIN_USER: admin
GF_SECURITY_ADMIN_PASSWORD: admin
volumes:
- ./grafana/provisioning:/etc/grafana/provisioning:ro
depends_on:
pcp:
condition: service_started
healthcheck:
test: ["CMD", "curl", "-sf", "http://localhost:3000/api/health"]
interval: 10s
timeout: 5s
retries: 12

# mcp-grafana — MCP server for Grafana, SSE transport on http://localhost:8000/sse
mcp-grafana:
image: mcp/grafana
ports:
- "8000:8000"
environment:
GRAFANA_URL: http://grafana:3000
GRAFANA_USERNAME: admin
GRAFANA_PASSWORD: admin
depends_on:
grafana:
condition: service_healthy
Loading