Skip to content

[Bug]: LanceDB index grows unbounded and fills the disk when cascade compaction/prune silently fails #315

Description

@dolphinsue319

Area

src/everos

What happened?

The on-disk LanceDB index under ~/.everos/.index/lancedb can grow without bound
until it fills the entire disk, while the Markdown source of truth stays tiny.

On a self-hosted single-user instance I have hit this twice:

  • atomic_fact.lance bloated to 318 GB (8,974 data fragments, 1,759 versions)
    while the live data was only a few MB and the Markdown truth was ~26 MB.
  • It recurred and reached 583 GB, filling the volume to 0 bytes free (shell
    itself started returning ENOSPC).

The root problem is that the cascade maintenance worker swallows every
compaction/prune failure
and keeps running, so stale LanceDB versions accumulate
forever with no cap, no metric, and no user-visible signal:

  • memory/cascade/worker.py _run_optimize_once wraps optimize()/cleanup in a
    broad except Exception: # never crash the daemon, logs a
    cascade_lancedb_optimize_failed warning, and continues.
  • There is no max-version / max-size cap on the index, no index-size health metric,
    and no prune/maintenance CLI to recover.

Once compaction is broken, prune never runs → versions pile up → disk fills →
death spiral: a full disk means compaction can't even write its temp scratch,
so it can never prune itself back down.

Two failure modes that break compaction

  1. FD exhaustion (EMFILE / os error 24). LanceDB maintenance needs ~290 FDs
    (per docs/cascade_runbook.md), but a daemon launched under macOS's default soft
    limit of 256 (launchctl limit maxfiles) hits EMFILE on every cleanup cycle.
    Logs showed thousands of os error 24 / "Too many open files". Raising the
    launcher's NumberOfFiles soft limit to 8192 fixed this mode.
  2. lance list-encoding corruption (persists even after the FD fix).
    optimize() dies with a lance 7.0.0 error like
    Max offset of 648640 exceeds length of values 466149 on an atomic_fact
    list<...> column (list.rs). Because optimize() runs compaction before
    cleanup, it never reaches the cleanup step → reclaims nothing → unbounded growth.

Steps to reproduce

  1. Run the EverOS daemon continuously and keep adding memories so the cascade worker
    compacts/prunes on its normal schedule.
  2. Cause compaction to fail — easiest is to launch the daemon under a low FD soft
    limit (macOS default 256), or let the lance list-encoding error above occur on
    atomic_fact.
  3. Watch du -sh ~/.everos/.index/lancedb climb into the tens/hundreds of GB while
    ~/.everos/evermem/**.md stays a few tens of MB.
  4. grep cascade_lancedb_optimize_failed in the logs — failures are logged as
    warnings only; the daemon keeps serving and never surfaces the bloat.

Environment

  • OS: macOS (Darwin), single-user self-host, LaunchAgent
  • EverOS: 1.0.0 and 1.1.0 (reproduced on both)
  • lance / lancedb: 7.0.0
  • Markdown truth ~34 MB; index bloated to 318 GB then 583 GB

Workaround

  • When compaction is broken but the disk still has room: stop the daemon and call
    lance cleanup_old_versions(older_than=timedelta(0), delete_unverified=True)
    directly on each *.lance dir — this bypasses the broken compaction step that
    Table.optimize() runs first, and reclaims the stale versions (row counts
    unchanged).
  • At a full disk that direct cleanup is impossible (no scratch space). Recovery:
    stop daemon → rm -rf ~/.everos/.index/{lancedb,sqlite} (the .index is 100%
    rebuildable; the Markdown at ~/.everos/evermem is the truth) → restart → re-embed
    from Markdown (slow).

Suggested fixes

  • Add a hard cap on index version count / size, or a watchdog that prunes when the
    index greatly exceeds the Markdown footprint.
  • Surface a health signal / metric when optimize() fails repeatedly instead of only
    a swallowed warning (e.g. expose index size + last-successful-compaction in
    /health or a status command).
  • Ship a first-class everos index prune / maintenance CLI that calls
    cleanup_old_versions directly (works even when optimize() compaction is broken).
  • Raise the FD soft limit in the bundled launchers/docs so EMFILE can't silently
    break maintenance out of the box.
  • Fix or work around the underlying lance list-encoding compaction bug
    (pin/upgrade lance, or rewrite the affected list<...> column).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions