Skip to content

feat(memory): scrub high-entropy secrets before persisting extracted memories + indexing VectorDB #2899

Description

@lg320531124

What happened

While debugging an OV (OpenViking memory server) configuration, an API key was pasted into a Claude Code conversation that OV's auto-capture hook records. OV captured the full message stream into a session (raw messages.jsonl, expected). The memory extraction pipeline then copied the key verbatim into a durable structured memory under data/viking/.../memories/events/<date>/<name>.md and indexed it into the vector store.

Result: the secret became retrievable across future sessions via find / search — a vector recall for an unrelated token surfaced the memory and returned the plaintext key into the LLM context.

Expected

The extraction pipeline's quality gates (the read-before-write + 6-gate filter that decides what becomes a durable memory) should prevent secrets from surviving into curated memories and the vector index.

Raw session capture containing a secret is expected — that's raw conversation, and redacting there would alter the raw-capture contract. But a curated, durable, vector-indexed memory containing a secret contradicts the gates' purpose: the whole point of the filter is to discard what should not be persisted.

Repro

  1. Run a session where one message contains a test-shaped secret, e.g. sk-test-FAKE0123456789abcdef.
  2. Trigger extraction (session commit / extract / SessionEnd).
  3. Inspect data/viking/.../memories/events/<date>/ — an extracted memory contains the secret verbatim.
  4. search / find for a token that appears near the secret — the vector store returns that memory, surfacing the secret into context.

Proposal

Add a secret-scrub gate between LLM extraction and persistence:

  • Regex patterns for common secret shapes:
    • sk-[A-Za-z0-9]{16,} (OpenAI-style)
    • AQ[A-Za-z0-9_-]{20,} (Gemini-style)
    • xox[baprs]-[A-Za-z0-9-]+ (Slack)
    • Bearer\s+[A-Za-z0-9._-]+
    • high-entropy hex / base64 of length ≥ 32
  • On match in an extracted memory body: replace with REDACTED_SECRET (or skip vector-indexing that memory, or flag it for review).
  • Configurable via env / ov.conf: pattern list + on/off, so users can extend for internal key shapes; default conservative to avoid false-positive scrubbing of legitimate tokens / UUIDs / commit SHAs.

Scope

Extraction layer only. Session-layer raw capture (messages.jsonl) intentionally unchanged — redacting there would break the raw-capture contract and make raw logs useless for debugging.

Caveat

Secret detection has false-positive risk (commit SHAs, content hashes, long UUIDs can look high-entropy). Suggest opt-in initially + an allowlist for known-safe patterns, so legitimate identifiers are not mangled.

Why I'm filing

I have two open PRs in the auto-capture / compaction direction (#2874, #2853). This issue is the privacy complement to that storage work — reducing what gets captured/stored is half the story; ensuring secrets don't survive curation into vector-retrievable memory is the other half. Happy to contribute a PR if there's appetite for the approach above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    Status
    In progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions