Skip to content

fix(integration-vercel): drain dense burst seconds via sub-second slicing#629

Merged
arberx merged 2 commits into
mainfrom
feat/vercel-dense-slice-subdivision
May 25, 2026
Merged

fix(integration-vercel): drain dense burst seconds via sub-second slicing#629
arberx merged 2 commits into
mainfrom
feat/vercel-dense-slice-subdivision

Conversation

@arberx
Copy link
Copy Markdown
Member

@arberx arberx commented May 24, 2026

Summary

Vercel sync was silently losing traffic data on busy projects. Real-world request-logs minutes routinely hold more than 1000 pages of events. The drain bottomed out at a one-minute floor: any such minute escalated to the floor-budget re-pull and any minute denser than FLOOR_SLICE_MAX_PAGES = 1000 failed the whole sync.

This lowers MIN_SUB_WINDOW_MS from 60_000 to 1_000 so the drain bisects time all the way down to one-second slices. A dense minute now drains via 60 ordinary one-second slices instead of escalating. Only a pathologically dense single second (1000+ pages in one second) still genuinely fails.

Observed impact (gjelina-hotel)

  • 44 of the last 50 syncs failed with Vercel 1-minute slice holds more than 1000 pages and cannot be drained further.
  • Each failure left lastSyncedAt un-advanced, so retries hit the same dense minute, producing rolling gaps up to 56 hours wide.
  • The doctor's traffic.source.recent-data check stayed green throughout (recent data did exist — it was just an old, partial snapshot), so the failure was invisible at the dashboard level.
  • Gaps are recoverable. Vercel retention on this project is 14 days (the past2Weeks filter in the Vercel UI is available), so the worst observed 56-hour gap sits inside retention with ~11.7 days of headroom. Backfill over the past 14 days after deploy will close them.

Mechanism the fix unblocks

Before:                                                          After:
[minute, 50 pages cap] -> overflow                               [minute, 50 pages cap] -> overflow
  bisect to 30s -> overflow                                        bisect to 30s -> overflow
    bisect to 15s -> overflow                                        bisect to 15s -> overflow
      bisect to 7s -> overflow                                         ... bisect to 1s -> 50 pages -> drain ok
        ... floor (60s = 1 minute) reached                           60 clean one-second pulls cover the minute
        re-pull at 1000 page budget -> overflow -> THROW          full minute drained, cursor advances

Decoupled retention probe width

resolveRetainedStart used MIN_SUB_WINDOW_MS for its retention-probe tail window. Lowering the floor to 1s would have narrowed that probe to a sliver. The fix introduces RETENTION_PROBE_WINDOW_MS = 60_000 and uses it for the probe — orthogonal to the drain floor, kept at one minute so a successful probe reliably means "Vercel will serve this range."

Test coverage

  • Updated two existing tests for the one-second floor (drains a dense one-second slice with the large floor page budget, throws only when a one-second slice overflows even the floor budget).
  • New regression test asserts a dense minute drains via sub-second slicing without touching the floor-budget re-pull — the test fails if the drain ever escalates a floor-width slice for a minute that one-second slicing should have handled.

No callers changed

packages/api-routes/src/traffic.ts still passes pagesPerSubWindow: 50 (sync) and pagesPerSubWindow: 1000 (backfill). Fix is entirely inside the drain.

Test plan

  • pnpm typecheck && pnpm lint && pnpm test clean (3279/3279 tests pass)
  • integration-vercel suite specifically: 21/21 tests pass
  • After merge + fresh build + redeploy, monitor gjelina-hotel for Vercel ... slice holds more than ... errors over a 24h window
  • Confirm lastSyncedAt on gjelina_hotel:vercel:prj_RIyGZN0PsR5SuMhMQUMVc2nrwf6E starts advancing
  • Run a 14-day backfill (cnry traffic backfill gjelina_hotel --source <id> --start <14d-ago> --end <now>) to close the historical gaps that the sync left behind. Backfill uses BACKFILL_MAX_PAGES = 1000 per slice (20x the sync budget) plus the new 1s floor, so even the densest historical minute drains via ordinary slicing.

🤖 Generated with Claude Code

arberx and others added 2 commits May 24, 2026 20:42
…cing

Real-world Vercel projects (gjelina-hotel) routinely hit 1000+
`request-logs` pages in a single minute. The previous drain bottomed
out at a one-minute floor: any minute denser than `pagesPerSubWindow`
escalated to the floor-budget re-pull, and minutes denser than
`FLOOR_SLICE_MAX_PAGES = 1000` failed the whole sync.

Symptom on gjelina-hotel: 44 of the last 50 syncs failed with
`Vercel 1-minute slice holds more than 1000 pages`. Each failure left
`lastSyncedAt` un-advanced, so the next sync retried the same dense
minute. Once a 56-hour gap formed, the failing window aged past Vercel's
plan retention and the lost data became unrecoverable.

Fix: lower `MIN_SUB_WINDOW_MS` from 60_000 to 1_000 so the drain can
bisect into one-second slices. A dense minute now drains via 60
ordinary one-second slices instead of escalating to the floor-budget
re-pull. Only a pathologically dense single second (1000+ pages of logs
in one second) still genuinely fails.

Decouples the retention probe width: `RETENTION_PROBE_WINDOW_MS` is now
its own constant (60_000) so reducing the drain floor does not narrow
the probe to a sliver.

- `packages/integration-vercel/src/drain.ts`: new constants, error
  message in seconds, comments rewritten for the one-second floor
- `packages/integration-vercel/test/drain.test.ts`: existing tests
  updated for the second floor; new regression test asserts a dense
  minute drains via sub-second slicing without touching the floor budget
- `packages/integration-vercel/AGENTS.md`: documented the one-second
  floor and the gjelina-class burst minute it unblocks

No DB changes. Both callers (`syncTrafficSource`, `runBackfill`)
continue to pass `pagesPerSubWindow: 50` and `pagesPerSubWindow: 1000`
respectively — the fix is entirely inside the drain.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@arberx arberx merged commit 9d670c6 into main May 25, 2026
12 checks passed
@arberx arberx deleted the feat/vercel-dense-slice-subdivision branch May 25, 2026 20:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant