Skip to content

fix(scrapfly-webhooks): handle binary screenshot bodies and correct extraction payload path#66

Merged
leggetter merged 2 commits into
mainfrom
fix/scrapfly-screenshot-and-extraction
May 15, 2026
Merged

fix(scrapfly-webhooks): handle binary screenshot bodies and correct extraction payload path#66
leggetter merged 2 commits into
mainfrom
fix/scrapfly-screenshot-and-extraction

Conversation

@leggetter
Copy link
Copy Markdown
Collaborator

Summary

Closes #65. Fixes two real bugs in scrapfly-webhooks caught while capturing live Scrapfly deliveries for the webhook-samples repo:

  1. Screenshot handler crashed on binary bodies. Dispatch reordered to happen before JSON parsing; screenshot path now treats the body as raw image bytes (it is — Scrapfly sends JPEG/PNG/WebP/GIF with a lying Content-Type: application/json).
  2. Extraction field path was wrong. payload.result.datapayload.data. The example used to silently log undefined.

What changed

Handler dispatch reordered (Express, Next.js, FastAPI + SKILL.md inline):

+ // Dispatch BEFORE JSON parse — screenshot deliveries are raw image bytes
+ if (resourceType === 'screenshot') {
+   console.log(`Screenshot received: ${req.body.length} bytes (binary)`);
+   return res.status(200).send('OK');
+ }
+
  const payload = JSON.parse(req.body.toString());
  switch (resourceType) {
    case 'scrape':       /* result.url / result.status_code (unchanged) */
-   case 'extraction':   console.log(payload.result?.data);
+   case 'extraction':   console.log({ content_type: payload.content_type, data: payload.data });
-   case 'screenshot':   console.log(payload.result?.screenshot_url);
+   // (no longer reached — handled above)
  }

Next.js framework fix: body reader switched from await request.text() to Buffer.from(await request.arrayBuffer())request.text() UTF-8-decodes the bytes and corrupts binary screenshot bodies; arrayBuffer() preserves them so the HMAC verifies correctly.

SKILL.md additions:

  • New Prerequisites section flagging the paid-plan requirement (was only in references/setup.md; pulled forward for visibility — FREE-tier accounts hide the webhook UI entirely and have a queue size of 0).
  • New "Screenshot is binary, not JSON" bullet in the key facts list, explaining the upstream Content-Type quirk.
  • New "Hookdeck Event Gateway alternative" bullet pointing to the built-in SCRAPFLY source type for edge verification.

references/verification.md additions:

  • "Screenshot deliveries are binary, not JSON" callout in the common gotchas, with the dispatch-before-parse fix.
  • "Alternative: Verify at the Gateway with Hookdeck" section covering the SCRAPFLY source-type plus the known May 2026 caveat (the Content-Type / binary-body mismatch causes Hookdeck to reject screenshot deliveries with UNPARSABLE_JSON when the preset is enabled; route screenshots directly until upstream resolves the Content-Type).
  • "Parsed JSON breaks signatures" entry updated to mention the Next.js arrayBuffer() fix.

Tests:

  • Two new tests per framework: a screenshot delivery with a binary body (12-byte minimal JPEG header) that asserts the handler returns 200 without crashing, and an invalid-signature variant on the binary body that asserts 401.
  • Extraction test rewritten to use the real captured shape ({ content_type, data: {...} }) instead of the bogus { result: { data: {...} } }.

Test plan

  • cd skills/scrapfly-webhooks/examples/express && npm test — 24 passed, 0 failed
  • cd skills/scrapfly-webhooks/examples/nextjs && npm test — 17 passed, 0 failed
  • cd skills/scrapfly-webhooks/examples/fastapi && pytest test_webhook.py -v — 19 passed, 0 failed
  • Real-world smoke-test against a live Scrapfly screenshot delivery (out of scope for the PR; relies on a paid Scrapfly account)

Not addressed in this PR

From issue #65, point 3.c (npx install path verification): we already use npx hookdeck-cli listen as the canonical pattern repo-wide (per AGENTS.md, set in PR #40 + #54) and have verified it works against the published hookdeck-cli npm package. Leaving as-is.

Unverified strengths preserved

The signature verification code (JS + Python), the full header list, timing-safe comparison, raw-body capture with express.raw({ type: '*/*' }) / request.arrayBuffer() / await request.body(), the no-replay-window note, and routing by X-Scrapfly-Webhook-Resource-Type all remain unchanged — they matched live behaviour and weren't part of the bug report.

https://claude.ai/code/session_01NNTgQRJss1V7gyzzJ9rjnB


Generated by Claude Code

claude added 2 commits May 15, 2026 14:51
…xtraction payload path

Fixes the two bugs reported in #65 (caught while capturing real Scrapfly
deliveries for the webhook-samples repo):

1. **Screenshot handler crashed on binary bodies.** Scrapfly's Screenshot
   API delivers raw image bytes (JPEG / PNG / WebP / GIF) but sets
   `Content-Type: application/json` — an upstream quirk that makes the
   header lie about the body. The previous example always ran
   `JSON.parse(req.body)` after signature verification, so screenshot
   deliveries verified successfully and then 500'd on parse. Fixed by
   dispatching on `X-Scrapfly-Webhook-Resource-Type` BEFORE JSON parsing
   in all three example handlers (Express / Next.js / FastAPI) and the
   SKILL.md inline code. Screenshot is now handled as a binary path that
   logs the byte count and exits without parsing; the Next.js handler
   also switched from `request.text()` to `request.arrayBuffer()` so
   binary bytes survive intact through the framework.

2. **Extraction field path was wrong.** Real extraction bodies expose
   the fields at `payload.data`, not `payload.result.data`. Fixed in
   SKILL.md and all three example handlers. Tests now use the real
   shape `{ content_type, data: {...} }` captured from live deliveries.

Also:

- Added a prominent **Prerequisites** section to SKILL.md noting the
  paid-plan requirement (FREE-tier accounts hide the webhook UI and
  have a queue size of 0, so no deliveries ever fire). Was already in
  `references/setup.md`; pulled it to SKILL.md for visibility.
- Added an **Alternative: Verify at the Gateway with Hookdeck**
  section in `references/verification.md`, calling out Hookdeck's
  built-in `SCRAPFLY` source-type that does edge verification, plus
  the known May 2026 caveat that the Content-Type / binary-body
  mismatch causes Hookdeck to reject screenshot deliveries with
  UNPARSABLE_JSON when the preset is enabled.
- Added two tests per framework covering the binary screenshot path
  (verifies HMAC over raw image bytes + handler doesn't try to JSON-
  parse) and the corrected extraction shape. All three example suites
  pass: 24 Express, 17 Next.js, 19 FastAPI.

Verification (HMAC-SHA256, uppercase hex, dual-case headers, raw-body
capture, no-replay-window, dispatch by resource type, crawler events)
is unchanged — those parts of the skill already matched live behaviour.

Closes #65.

https://claude.ai/code/session_01NNTgQRJss1V7gyzzJ9rjnB
…board setting, not "always JSON"

Follow-up to the previous commit on this PR. The earlier framing said
"Scrapfly nevertheless sets Content-Type: application/json on the
request" which is technically wrong — Scrapfly's webhook config exposes
a Content-Type dropdown with `application/json` (default) and
`application/msgpack` options, and the configured value is what gets
sent verbatim on every delivery. The actual upstream quirk is narrower:
Scrapfly Screenshot deliveries are raw image bytes regardless of which
Content-Type you configured (the configured value is sent in the header
but doesn't change the body for image deliveries).

Updated five places to reframe accordingly:

- SKILL.md "Screenshot is binary" bullet and the inline-handler comment
- references/verification.md "Screenshot deliveries are binary" gotcha
  and the Hookdeck-Gateway caveat
- references/setup.md gains a Content-Type bullet in the dashboard
  config steps, calling out that JSON is the default and that msgpack
  users need to swap the parser in handler scrape/extraction branches
- Express, Next.js, FastAPI handler comments

Added a new gotcha entry in references/verification.md and a comment in
each handler example explaining how to handle msgpack-configured
webhooks (the dispatch-before-parse pattern is unchanged; only the
parse step needs to swap to a msgpack decoder).

No code behaviour changes — only doc/comment text. All three example
suites still pass: 24 Express, 17 Next.js, 19 FastAPI.

https://claude.ai/code/session_01NNTgQRJss1V7gyzzJ9rjnB
@leggetter leggetter merged commit 0b7bf3f into main May 15, 2026
6 checks passed
@leggetter leggetter deleted the fix/scrapfly-screenshot-and-extraction branch May 15, 2026 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

scrapfly-webhooks: handler example breaks for screenshot; extraction field path is wrong

2 participants