fix(integration-vercel): drain dense burst seconds via sub-second slicing#629
Merged
Conversation
…cing Real-world Vercel projects (gjelina-hotel) routinely hit 1000+ `request-logs` pages in a single minute. The previous drain bottomed out at a one-minute floor: any minute denser than `pagesPerSubWindow` escalated to the floor-budget re-pull, and minutes denser than `FLOOR_SLICE_MAX_PAGES = 1000` failed the whole sync. Symptom on gjelina-hotel: 44 of the last 50 syncs failed with `Vercel 1-minute slice holds more than 1000 pages`. Each failure left `lastSyncedAt` un-advanced, so the next sync retried the same dense minute. Once a 56-hour gap formed, the failing window aged past Vercel's plan retention and the lost data became unrecoverable. Fix: lower `MIN_SUB_WINDOW_MS` from 60_000 to 1_000 so the drain can bisect into one-second slices. A dense minute now drains via 60 ordinary one-second slices instead of escalating to the floor-budget re-pull. Only a pathologically dense single second (1000+ pages of logs in one second) still genuinely fails. Decouples the retention probe width: `RETENTION_PROBE_WINDOW_MS` is now its own constant (60_000) so reducing the drain floor does not narrow the probe to a sliver. - `packages/integration-vercel/src/drain.ts`: new constants, error message in seconds, comments rewritten for the one-second floor - `packages/integration-vercel/test/drain.test.ts`: existing tests updated for the second floor; new regression test asserts a dense minute drains via sub-second slicing without touching the floor budget - `packages/integration-vercel/AGENTS.md`: documented the one-second floor and the gjelina-class burst minute it unblocks No DB changes. Both callers (`syncTrafficSource`, `runBackfill`) continue to pass `pagesPerSubWindow: 50` and `pagesPerSubWindow: 1000` respectively — the fix is entirely inside the drain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Vercel sync was silently losing traffic data on busy projects. Real-world
request-logsminutes routinely hold more than 1000 pages of events. The drain bottomed out at a one-minute floor: any such minute escalated to the floor-budget re-pull and any minute denser thanFLOOR_SLICE_MAX_PAGES = 1000failed the whole sync.This lowers
MIN_SUB_WINDOW_MSfrom60_000to1_000so the drain bisects time all the way down to one-second slices. A dense minute now drains via 60 ordinary one-second slices instead of escalating. Only a pathologically dense single second (1000+ pages in one second) still genuinely fails.Observed impact (gjelina-hotel)
Vercel 1-minute slice holds more than 1000 pages and cannot be drained further.lastSyncedAtun-advanced, so retries hit the same dense minute, producing rolling gaps up to 56 hours wide.traffic.source.recent-datacheck stayed green throughout (recent data did exist — it was just an old, partial snapshot), so the failure was invisible at the dashboard level.past2Weeksfilter in the Vercel UI is available), so the worst observed 56-hour gap sits inside retention with ~11.7 days of headroom. Backfill over the past 14 days after deploy will close them.Mechanism the fix unblocks
Decoupled retention probe width
resolveRetainedStartusedMIN_SUB_WINDOW_MSfor its retention-probe tail window. Lowering the floor to 1s would have narrowed that probe to a sliver. The fix introducesRETENTION_PROBE_WINDOW_MS = 60_000and uses it for the probe — orthogonal to the drain floor, kept at one minute so a successful probe reliably means "Vercel will serve this range."Test coverage
drains a dense one-second slice with the large floor page budget,throws only when a one-second slice overflows even the floor budget).No callers changed
packages/api-routes/src/traffic.tsstill passespagesPerSubWindow: 50(sync) andpagesPerSubWindow: 1000(backfill). Fix is entirely inside the drain.Test plan
pnpm typecheck && pnpm lint && pnpm testclean (3279/3279 tests pass)integration-vercelsuite specifically: 21/21 tests passVercel ... slice holds more than ...errors over a 24h windowlastSyncedAtongjelina_hotel:vercel:prj_RIyGZN0PsR5SuMhMQUMVc2nrwf6Estarts advancingcnry traffic backfill gjelina_hotel --source <id> --start <14d-ago> --end <now>) to close the historical gaps that the sync left behind. Backfill usesBACKFILL_MAX_PAGES = 1000per slice (20x the sync budget) plus the new 1s floor, so even the densest historical minute drains via ordinary slicing.🤖 Generated with Claude Code