feat(memory): scrub high-entropy secrets before persisting extracted memories + indexing VectorDB

## What happened

While debugging an OV (OpenViking memory server) configuration, an API key was pasted into a Claude Code conversation that OV's auto-capture hook records. OV captured the full message stream into a session (raw `messages.jsonl`, expected). The memory extraction pipeline then **copied the key verbatim into a durable structured memory** under `data/viking/.../memories/events/<date>/<name>.md` and indexed it into the vector store.

Result: the secret became **retrievable across future sessions** via `find` / `search` — a vector recall for an unrelated token surfaced the memory and returned the plaintext key into the LLM context.

## Expected

The extraction pipeline's quality gates (the read-before-write + 6-gate filter that decides what becomes a durable memory) should prevent secrets from surviving into curated memories and the vector index.

Raw session capture containing a secret is expected — that's raw conversation, and redacting there would alter the raw-capture contract. But a **curated, durable, vector-indexed memory** containing a secret contradicts the gates' purpose: the whole point of the filter is to discard what should not be persisted.

## Repro

1. Run a session where one message contains a test-shaped secret, e.g. `sk-test-FAKE0123456789abcdef`.
2. Trigger extraction (session commit / `extract` / `SessionEnd`).
3. Inspect `data/viking/.../memories/events/<date>/` — an extracted memory contains the secret verbatim.
4. `search` / `find` for a token that appears near the secret — the vector store returns that memory, surfacing the secret into context.

## Proposal

Add a **secret-scrub gate between LLM extraction and persistence**:

- Regex patterns for common secret shapes:
  - `sk-[A-Za-z0-9]{16,}` (OpenAI-style)
  - `AQ[A-Za-z0-9_-]{20,}` (Gemini-style)
  - `xox[baprs]-[A-Za-z0-9-]+` (Slack)
  - `Bearer\s+[A-Za-z0-9._-]+`
  - high-entropy hex / base64 of length ≥ 32
- On match in an extracted memory body: replace with `REDACTED_SECRET` (or skip vector-indexing that memory, or flag it for review).
- Configurable via env / `ov.conf`: pattern list + on/off, so users can extend for internal key shapes; default conservative to avoid false-positive scrubbing of legitimate tokens / UUIDs / commit SHAs.

## Scope

**Extraction layer only.** Session-layer raw capture (`messages.jsonl`) intentionally unchanged — redacting there would break the raw-capture contract and make raw logs useless for debugging.

## Caveat

Secret detection has false-positive risk (commit SHAs, content hashes, long UUIDs can look high-entropy). Suggest opt-in initially + an allowlist for known-safe patterns, so legitimate identifiers are not mangled.

## Why I'm filing

I have two open PRs in the auto-capture / compaction direction (#2874, #2853). This issue is the privacy complement to that storage work — reducing what gets captured/stored is half the story; ensuring secrets don't survive *curation* into vector-retrievable memory is the other half. Happy to contribute a PR if there's appetite for the approach above.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(memory): scrub high-entropy secrets before persisting extracted memories + indexing VectorDB #2899

What happened

Expected

Repro

Proposal

Scope

Caveat

Why I'm filing

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

feat(memory): scrub high-entropy secrets before persisting extracted memories + indexing VectorDB #2899

Description

What happened

Expected

Repro

Proposal

Scope

Caveat

Why I'm filing

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions