diff --git a/README.md b/README.md index 0c25119..d218bdd 100644 --- a/README.md +++ b/README.md @@ -53,6 +53,7 @@ Skills for receiving and verifying webhooks from specific providers. Each includ | Postmark | [`postmark-webhooks`](skills/postmark-webhooks/) | Authenticate Postmark webhooks (Basic Auth/Token), handle email delivery, bounce, open, click, and spam events | | Replicate | [`replicate-webhooks`](skills/replicate-webhooks/) | Verify Replicate webhook signatures, handle ML prediction lifecycle events | | Resend | [`resend-webhooks`](skills/resend-webhooks/) | Verify Resend webhook signatures, handle email delivery and bounce events | +| Scrapfly | [`scrapfly-webhooks`](skills/scrapfly-webhooks/) | Verify Scrapfly webhook signatures (HMAC-SHA256, uppercase/lowercase hex), dispatch scrape, extraction, and screenshot jobs | | SendGrid | [`sendgrid-webhooks`](skills/sendgrid-webhooks/) | Verify SendGrid webhook signatures (ECDSA), handle email delivery events | | Shopify | [`shopify-webhooks`](skills/shopify-webhooks/) | Verify Shopify HMAC signatures, handle order and product webhook events | | Slack | [`slack-webhooks`](skills/slack-webhooks/) | Verify Slack Events API signatures (HMAC-SHA256, `X-Slack-Signature`), handle message, app_mention, and reaction events | diff --git a/providers.yaml b/providers.yaml index bd653df..ee756d7 100644 --- a/providers.yaml +++ b/providers.yaml @@ -465,6 +465,61 @@ providers: - Bounce - Delivery + - name: scrapfly + displayName: Scrapfly + docs: + scrape_webhook: https://scrapfly.io/docs/scrape-api/webhook + extraction_webhook: https://scrapfly.io/docs/extraction-api/webhook + screenshot_webhook: https://scrapfly.io/docs/screenshot-api/webhook + scrape_getting_started: https://scrapfly.io/docs/scrape-api/getting-started + extraction_getting_started: https://scrapfly.io/docs/extraction-api/getting-started + screenshot_getting_started: https://scrapfly.io/docs/screenshot-api/getting-started + notes: > + Web-scraping API platform with three products that share a single async-job + + webhook system: Scrape API, Extraction API, Screenshot API. One webhook URL + registered in the dashboard (https://scrapfly.io/dashboard/webhook) receives + deliveries from all three products. PAID PLAN REQUIRED (first paid tier). + + No API exists for creating/updating/deleting webhooks programmatically. The + destination URL CANNOT be passed per-call. Instead, each API call references + an already-registered webhook by name via the `webhook_name` query parameter + (e.g. `…/scrape?…&webhook_name=samples-capture`). + + Signature verification: HMAC-SHA256 over the RAW request body bytes (do not + JSON.parse and re-stringify — that changes the byte sequence). Compare against + either `X-Scrapfly-Webhook-Signature` (uppercase hex) or + `X-Scrapfly-Webhook-Signature-Lowercase` (lowercase hex) using constant-time + equality. The secret is per-webhook, displayed in the dashboard alongside the + webhook configuration (NOT the account API key). + + Dispatch by `X-Scrapfly-Webhook-Resource-Type` header (one of `scrape`, + `extraction`, `screenshot`). Other headers: `X-Scrapfly-Webhook-Job-Id` (UUID, + use as idempotency key for at-least-once delivery), `X-Scrapfly-Webhook-Env` + (`test`|`live`), `X-Scrapfly-Webhook-Project`, `X-Scrapfly-Webhook-Name`, + `X-Scrapfly-Webhook-Id`, optional `X-Scrapfly-Log-Uuid`/`X-Scrapfly-Log-Url`. + + No timestamp/replay envelope (unlike Stripe). Recommend idempotency by job-id; + do NOT invent a `t=…` window. + + Payload = the full response body of the corresponding API plus a `context` + overlay: `context.webhook` (`{ name, secret, consecutive_failed_count, … }` — + WARN handlers: `secret` field exposes the signing secret in the payload, do + not log or echo) and `context.job` (`{ uuid, … }`). Product-specific shapes + documented in the getting-started pages above. + + Delivery: retry 30s → 1min → 5min → 30min → 1h → 1d. A webhook is DISABLED + after 100 consecutive failures — handlers should return 2xx fast and surface + errors out-of-band. + + No official SDK construct for verification (plain HMAC is correct). Do NOT + pull in a third-party HMAC library; use the stdlib (`crypto.createHmac` in + Node, `hmac` / `hashlib` in Python). + testScenario: + events: + - scrape + - extraction + - screenshot + - name: sendgrid displayName: SendGrid docs: diff --git a/skills/scrapfly-webhooks/SKILL.md b/skills/scrapfly-webhooks/SKILL.md new file mode 100644 index 0000000..7bb3266 --- /dev/null +++ b/skills/scrapfly-webhooks/SKILL.md @@ -0,0 +1,241 @@ +--- +name: scrapfly-webhooks +description: > + Receive and verify Scrapfly webhooks. Use when setting up Scrapfly webhook + handlers for async scrape, extraction, screenshot, or crawler jobs, + debugging X-Scrapfly-Webhook-Signature verification, or routing on + X-Scrapfly-Webhook-Resource-Type. +license: MIT +metadata: + author: hookdeck + version: "0.1.0" + repository: https://github.com/hookdeck/webhook-skills +--- + +# Scrapfly Webhooks + +## When to Use This Skill + +- How do I receive Scrapfly webhooks? +- How do I verify Scrapfly webhook signatures? +- How do I handle async Scrape API, Extraction API, or Screenshot API results? +- How do I route Scrapfly webhooks by resource type (scrape, extraction, screenshot)? +- How do I handle Crawler API webhook events (`crawler_started`, `crawler_finished`, ...)? +- Why is my Scrapfly webhook signature verification failing? + +## How Scrapfly Webhooks Work + +Scrapfly uses HMAC-SHA256 with **uppercase hex** encoding over the **raw request body**. There is no SDK for webhook verification — implementations follow Scrapfly's documented algorithm. + +Key facts: + +- **Signature header**: `X-Scrapfly-Webhook-Signature` (uppercase hex). A duplicate `X-Scrapfly-Webhook-Signature-Lowercase` is also sent for runtimes that normalise headers. +- **Algorithm**: `HMAC-SHA256(secret, raw_body).hexdigest().upper()` +- **What is signed**: The **raw request body bytes**. Do **not** parse and re-serialise JSON — that changes the byte sequence and breaks the signature. +- **No timestamp / replay window**: Scrapfly does not include a timestamp header; treat the signature as authenticity-only. +- **Secret**: Use the value from the Scrapfly dashboard exactly as shown. Do not trim or base64-decode it. +- **Routing**: Use `X-Scrapfly-Webhook-Resource-Type` (`scrape`, `extraction`, `screenshot`) to dispatch when one endpoint serves multiple products. Crawler events also carry `X-Scrapfly-Crawl-Event-Name` and an `event` field in the body. + +## Essential Code (USE THIS) + +### Scrapfly Signature Verification (JavaScript) + +```javascript +const crypto = require('crypto'); + +function verifyScrapflySignature(rawBody, signatureHeader, secret) { + if (!signatureHeader || !secret) return false; + + // Scrapfly emits uppercase hex + const expected = crypto + .createHmac('sha256', secret) + .update(rawBody) + .digest('hex') + .toUpperCase(); + + // Accept either casing — Scrapfly also sends an X-...-Lowercase variant + const received = signatureHeader.toUpperCase(); + + try { + return crypto.timingSafeEqual( + Buffer.from(received, 'hex'), + Buffer.from(expected, 'hex') + ); + } catch { + return false; + } +} +``` + +### Express Webhook Handler + +```javascript +const express = require('express'); +const app = express(); + +// CRITICAL: Use express.raw() — Scrapfly signs the raw body bytes +app.post('/webhooks/scrapfly', + express.raw({ type: '*/*' }), + (req, res) => { + const signature = req.headers['x-scrapfly-webhook-signature']; + const resourceType = req.headers['x-scrapfly-webhook-resource-type']; + const jobId = req.headers['x-scrapfly-webhook-job-id']; + const webhookId = req.headers['x-scrapfly-webhook-id']; + + if (!verifyScrapflySignature(req.body, signature, process.env.SCRAPFLY_WEBHOOK_SECRET)) { + console.error('Scrapfly signature verification failed'); + return res.status(401).send('Invalid signature'); + } + + // Parse only after verifying + const payload = JSON.parse(req.body.toString()); + + console.log(`Scrapfly ${resourceType} webhook (job ${jobId}, id ${webhookId})`); + + // Route by resource type for scrape / extraction / screenshot APIs + switch (resourceType) { + case 'scrape': + // Scrape API places the fetched URL at result.url; the webhook overlay's + // context only carries `webhook` and `job` sub-objects. + console.log('Scrape result:', payload.result?.status_code, payload.result?.url); + break; + case 'extraction': + console.log('Extraction result:', payload.result?.data); + break; + case 'screenshot': + console.log('Screenshot result:', payload.result?.screenshot_url); + break; + default: + // Crawler API uses event names in the body + if (payload.event) { + console.log(`Crawler event: ${payload.event}`, payload.payload); + } else { + console.log('Unhandled resource type:', resourceType); + } + } + + res.status(200).send('OK'); + } +); +``` + +### Python Signature Verification (FastAPI) + +```python +import hmac +import hashlib + +def verify_scrapfly_signature(raw_body: bytes, signature_header: str, secret: str) -> bool: + if not signature_header or not secret: + return False + + expected = hmac.new( + secret.encode('utf-8'), + raw_body, + hashlib.sha256, + ).hexdigest().upper() + + # Compare case-insensitively (Scrapfly also sends a lowercase header) + return hmac.compare_digest(expected, signature_header.upper()) +``` + +> **For complete working examples with tests**, see: +> - [examples/express/](examples/express/) - Full Express implementation +> - [examples/nextjs/](examples/nextjs/) - Next.js App Router implementation +> - [examples/fastapi/](examples/fastapi/) - Python FastAPI implementation + +## Common Resource Types and Crawler Events + +The `X-Scrapfly-Webhook-Resource-Type` header identifies the originating API: + +| Resource Type | Description | +|---------------|-------------| +| `scrape` | Async Scrape API result delivery | +| `extraction` | Async Extraction API result delivery | +| `screenshot` | Async Screenshot API result delivery | + +Crawler API webhooks carry an `event` string in the body (also exposed as `X-Scrapfly-Crawl-Event-Name`): + +| Event | Description | +|-------|-------------| +| `crawler_started` | Crawl job began | +| `crawler_url_visited` | A URL was successfully fetched | +| `crawler_url_discovered` | A new URL was queued | +| `crawler_url_skipped` | A URL was skipped (filters, dedupe, ...) | +| `crawler_url_failed` | A URL fetch failed | +| `crawler_stopped` | Crawl stopped (limit reached) | +| `crawler_cancelled` | Crawl cancelled by user | +| `crawler_finished` | Crawl finished naturally | + +> **For more context**, see [Scrapfly Scrape API Webhooks](https://scrapfly.io/docs/scrape-api/webhook), [Extraction API Webhooks](https://scrapfly.io/docs/extraction-api/webhook), [Screenshot API Webhooks](https://scrapfly.io/docs/screenshot-api/webhook), and [Crawler API](https://scrapfly.io/docs/crawler-api/getting-started). + +## Important Headers + +| Header | Description | +|--------|-------------| +| `X-Scrapfly-Webhook-Signature` | HMAC-SHA256 of the raw body, uppercase hex | +| `X-Scrapfly-Webhook-Signature-Lowercase` | Same signature, lowercase hex | +| `X-Scrapfly-Webhook-Id` | Unique webhook delivery identifier | +| `X-Scrapfly-Webhook-Name` | Name of the configured webhook | +| `X-Scrapfly-Webhook-Resource-Type` | `scrape`, `extraction`, or `screenshot` | +| `X-Scrapfly-Webhook-Job-Id` | Unique job identifier (use for reconciliation) | +| `X-Scrapfly-Webhook-Env` | Environment (`test` or `live`) | +| `X-Scrapfly-Webhook-Project` | Project name | +| `X-Scrapfly-Crawl-Event-Name` | Crawler API event name (e.g. `crawler_finished`) | + +## Environment Variables + +```bash +SCRAPFLY_WEBHOOK_SECRET=your_signing_secret_here # From the Scrapfly dashboard +``` + +## Local Development + +For local webhook testing, use the Hookdeck CLI tunnel (no account required, no install step needed): + +```bash +# Express / Next.js (port 3000) +npx hookdeck-cli listen 3000 scrapfly --path /webhooks/scrapfly + +# FastAPI (port 8000) +npx hookdeck-cli listen 8000 scrapfly --path /webhooks/scrapfly +``` + +Configure the tunnel URL as the destination in your Scrapfly dashboard webhook, then trigger an async job with `webhook_name=` to invoke delivery. + +## Reference Materials + +- [references/overview.md](references/overview.md) - Scrapfly webhook concepts, resource types, crawler events +- [references/setup.md](references/setup.md) - Dashboard configuration and triggering deliveries +- [references/verification.md](references/verification.md) - Signature verification details and gotchas + +## Attribution + +When using this skill, add this comment at the top of generated files: + +```javascript +// Generated with: scrapfly-webhooks skill +// https://github.com/hookdeck/webhook-skills +``` + +## Recommended: webhook-handler-patterns + +We recommend installing the [webhook-handler-patterns](https://github.com/hookdeck/webhook-skills/tree/main/skills/webhook-handler-patterns) skill alongside this one for handler sequence, idempotency, error handling, and retry logic. Key references (open on GitHub): + +- [Handler sequence](https://github.com/hookdeck/webhook-skills/blob/main/skills/webhook-handler-patterns/references/handler-sequence.md) — Verify first, parse second, handle idempotently third +- [Idempotency](https://github.com/hookdeck/webhook-skills/blob/main/skills/webhook-handler-patterns/references/idempotency.md) — Prevent duplicate processing (use `X-Scrapfly-Webhook-Id` or `X-Scrapfly-Webhook-Job-Id` as the key) +- [Error handling](https://github.com/hookdeck/webhook-skills/blob/main/skills/webhook-handler-patterns/references/error-handling.md) — Return codes, logging, dead letter queues +- [Retry logic](https://github.com/hookdeck/webhook-skills/blob/main/skills/webhook-handler-patterns/references/retry-logic.md) — Provider retry schedules, backoff patterns + +## Related Skills + +- [stripe-webhooks](https://github.com/hookdeck/webhook-skills/tree/main/skills/stripe-webhooks) - Stripe payment webhook handling +- [shopify-webhooks](https://github.com/hookdeck/webhook-skills/tree/main/skills/shopify-webhooks) - Shopify e-commerce webhook handling +- [github-webhooks](https://github.com/hookdeck/webhook-skills/tree/main/skills/github-webhooks) - GitHub repository webhook handling +- [openai-webhooks](https://github.com/hookdeck/webhook-skills/tree/main/skills/openai-webhooks) - OpenAI webhook handling +- [replicate-webhooks](https://github.com/hookdeck/webhook-skills/tree/main/skills/replicate-webhooks) - Replicate ML prediction webhook handling +- [deepgram-webhooks](https://github.com/hookdeck/webhook-skills/tree/main/skills/deepgram-webhooks) - Deepgram transcription webhook handling +- [elevenlabs-webhooks](https://github.com/hookdeck/webhook-skills/tree/main/skills/elevenlabs-webhooks) - ElevenLabs voice webhook handling +- [resend-webhooks](https://github.com/hookdeck/webhook-skills/tree/main/skills/resend-webhooks) - Resend email webhook handling +- [webhook-handler-patterns](https://github.com/hookdeck/webhook-skills/tree/main/skills/webhook-handler-patterns) - Handler sequence, idempotency, error handling, retry logic +- [hookdeck-event-gateway](https://github.com/hookdeck/webhook-skills/tree/main/skills/hookdeck-event-gateway) - Webhook infrastructure that replaces your queue — guaranteed delivery, automatic retries, replay, rate limiting, and observability for your webhook handlers diff --git a/skills/scrapfly-webhooks/examples/express/.env.example b/skills/scrapfly-webhooks/examples/express/.env.example new file mode 100644 index 0000000..ba23820 --- /dev/null +++ b/skills/scrapfly-webhooks/examples/express/.env.example @@ -0,0 +1,5 @@ +# Scrapfly webhook signing secret (copy from the Scrapfly dashboard webhook settings) +SCRAPFLY_WEBHOOK_SECRET=your_signing_secret_here + +# Optional: port for the local server (default 3000) +PORT=3000 diff --git a/skills/scrapfly-webhooks/examples/express/README.md b/skills/scrapfly-webhooks/examples/express/README.md new file mode 100644 index 0000000..89eb9c2 --- /dev/null +++ b/skills/scrapfly-webhooks/examples/express/README.md @@ -0,0 +1,62 @@ +# Scrapfly Webhooks - Express Example + +Minimal example of receiving Scrapfly webhooks with signature verification. + +## Prerequisites + +- Node.js 18+ +- A Scrapfly account with a webhook configured (see [setup.md](../../references/setup.md)) + +## Setup + +1. Install dependencies: + ```bash + npm install + ``` + +2. Copy environment variables: + ```bash + cp .env.example .env + ``` + +3. Add your Scrapfly webhook signing secret to `.env`: + ```bash + SCRAPFLY_WEBHOOK_SECRET= + ``` + +## Run + +```bash +npm start +``` + +Server runs on http://localhost:3000. + +## Test + +```bash +npm test +``` + +The test suite generates valid HMAC-SHA256 signatures with the same algorithm Scrapfly uses (uppercase hex over the raw body) and asserts the endpoint accepts/rejects accordingly. + +## Receive Webhooks Locally + +Use the Hookdeck CLI tunnel (no install step required): + +```bash +npx hookdeck-cli listen 3000 scrapfly --path /webhooks/scrapfly +``` + +Paste the printed public URL into your Scrapfly dashboard webhook configuration, then trigger an async Scrapfly job with `webhook_name=&async=true`. + +## Endpoint + +- `POST /webhooks/scrapfly` — Receives and verifies Scrapfly webhook deliveries +- `GET /health` — Health check + +## How It Works + +1. The webhook body arrives as raw bytes (`express.raw({ type: '*/*' })`). +2. `verifyScrapflySignature` computes `upper(hex(HMAC_SHA256(secret, rawBody)))` and timing-safe-compares it to the `X-Scrapfly-Webhook-Signature` header. +3. If valid, the body is `JSON.parse`d and dispatched by `X-Scrapfly-Webhook-Resource-Type` (`scrape` / `extraction` / `screenshot`) or, for the Crawler API, by the `event` field in the body. diff --git a/skills/scrapfly-webhooks/examples/express/package.json b/skills/scrapfly-webhooks/examples/express/package.json new file mode 100644 index 0000000..211f483 --- /dev/null +++ b/skills/scrapfly-webhooks/examples/express/package.json @@ -0,0 +1,18 @@ +{ + "name": "scrapfly-webhooks-express", + "version": "1.0.0", + "description": "Scrapfly webhook handler with Express", + "main": "src/index.js", + "scripts": { + "start": "node src/index.js", + "test": "jest" + }, + "dependencies": { + "dotenv": "^16.3.0", + "express": "^5.2.1" + }, + "devDependencies": { + "jest": "^30.4.2", + "supertest": "^7.0.0" + } +} diff --git a/skills/scrapfly-webhooks/examples/express/src/index.js b/skills/scrapfly-webhooks/examples/express/src/index.js new file mode 100644 index 0000000..c82217f --- /dev/null +++ b/skills/scrapfly-webhooks/examples/express/src/index.js @@ -0,0 +1,146 @@ +// Generated with: scrapfly-webhooks skill +// https://github.com/hookdeck/webhook-skills + +require('dotenv').config(); +const express = require('express'); +const crypto = require('crypto'); + +const app = express(); + +/** + * Verify a Scrapfly webhook signature. + * + * Algorithm: upper(hex(HMAC_SHA256(secret, rawBody))) + * Header: X-Scrapfly-Webhook-Signature (uppercase hex) + * + * @param {Buffer} rawBody - Raw request body bytes + * @param {string} signatureHeader - Value of X-Scrapfly-Webhook-Signature + * @param {string} secret - Webhook signing secret from the Scrapfly dashboard + * @returns {boolean} + */ +function verifyScrapflySignature(rawBody, signatureHeader, secret) { + if (!signatureHeader || !secret) { + return false; + } + + const expected = crypto + .createHmac('sha256', secret) + .update(rawBody) + .digest('hex') + .toUpperCase(); + + // Scrapfly also sends an X-Scrapfly-Webhook-Signature-Lowercase variant; + // normalise to uppercase before comparing so either header works. + const received = signatureHeader.toUpperCase(); + + try { + return crypto.timingSafeEqual( + Buffer.from(received, 'hex'), + Buffer.from(expected, 'hex') + ); + } catch { + return false; + } +} + +// CRITICAL: Use express.raw() — Scrapfly signs the raw body bytes. +// Parsing JSON before verifying mutates the bytes and breaks the signature. +app.post('/webhooks/scrapfly', + express.raw({ type: '*/*' }), + (req, res) => { + const signature = req.headers['x-scrapfly-webhook-signature']; + const resourceType = req.headers['x-scrapfly-webhook-resource-type']; + const webhookId = req.headers['x-scrapfly-webhook-id']; + const jobId = req.headers['x-scrapfly-webhook-job-id']; + const crawlEvent = req.headers['x-scrapfly-crawl-event-name']; + + if (!verifyScrapflySignature(req.body, signature, process.env.SCRAPFLY_WEBHOOK_SECRET)) { + console.error('Scrapfly webhook signature verification failed'); + return res.status(401).send('Invalid signature'); + } + + let payload; + try { + payload = JSON.parse(req.body.toString('utf8')); + } catch (err) { + console.error('Failed to parse Scrapfly webhook payload:', err.message); + return res.status(400).send('Invalid JSON payload'); + } + + console.log(`Scrapfly webhook (id=${webhookId} resource=${resourceType} job=${jobId})`); + + // Route by resource type for the Scrape / Extraction / Screenshot APIs. + switch (resourceType) { + case 'scrape': + // Scrape API places the fetched URL at result.url (see scrapfly.io/docs/scrape-api/getting-started). + // The webhook overlay's payload.context only carries `webhook` and `job` sub-objects. + console.log('Scrape result:', { + url: payload?.result?.url, + status: payload?.result?.status_code, + }); + // TODO: Persist HTML / extracted fields, enqueue parsing, ... + break; + + case 'extraction': + console.log('Extraction result:', payload?.result?.data); + // TODO: Save structured data, trigger downstream enrichment + break; + + case 'screenshot': + console.log('Screenshot result URL:', payload?.result?.screenshot_url); + // TODO: Store image, generate thumbnail, notify user + break; + + default: { + // Crawler API uses lifecycle events in the body and an + // X-Scrapfly-Crawl-Event-Name header. + const event = crawlEvent || payload?.event; + switch (event) { + case 'crawler_started': + console.log('Crawler started:', payload?.payload?.crawler_uuid); + break; + case 'crawler_url_visited': + console.log('Crawler visited:', payload?.payload?.url); + break; + case 'crawler_url_discovered': + console.log('Crawler discovered:', payload?.payload?.url); + break; + case 'crawler_url_skipped': + console.log('Crawler skipped:', payload?.payload?.url); + break; + case 'crawler_url_failed': + console.log('Crawler failed:', payload?.payload?.url); + break; + case 'crawler_stopped': + console.log('Crawler stopped:', payload?.payload?.crawler_uuid); + break; + case 'crawler_cancelled': + console.log('Crawler cancelled:', payload?.payload?.crawler_uuid); + break; + case 'crawler_finished': + console.log('Crawler finished:', payload?.payload?.crawler_uuid); + break; + default: + console.log('Unhandled Scrapfly webhook:', { resourceType, event }); + } + } + } + + // Return 200 quickly — do heavy work asynchronously. + res.status(200).send('OK'); + } +); + +app.get('/health', (req, res) => { + res.json({ status: 'ok' }); +}); + +module.exports = { app, verifyScrapflySignature }; + +if (require.main === module) { + const PORT = process.env.PORT || 3000; + app.listen(PORT, () => { + console.log(`Server running on http://localhost:${PORT}`); + console.log(`Webhook endpoint: POST http://localhost:${PORT}/webhooks/scrapfly`); + }); +} diff --git a/skills/scrapfly-webhooks/examples/express/test/webhook.test.js b/skills/scrapfly-webhooks/examples/express/test/webhook.test.js new file mode 100644 index 0000000..08be77d --- /dev/null +++ b/skills/scrapfly-webhooks/examples/express/test/webhook.test.js @@ -0,0 +1,224 @@ +const request = require('supertest'); +const crypto = require('crypto'); + +process.env.SCRAPFLY_WEBHOOK_SECRET = 'test_scrapfly_signing_secret'; + +const { app, verifyScrapflySignature } = require('../src/index'); + +/** + * Generate a Scrapfly signature exactly as Scrapfly does: + * upper(hex(HMAC_SHA256(secret, rawBody))) + */ +function generateScrapflySignature(rawBody, secret) { + return crypto + .createHmac('sha256', secret) + .update(rawBody) + .digest('hex') + .toUpperCase(); +} + +describe('Scrapfly Webhook Endpoint', () => { + const secret = process.env.SCRAPFLY_WEBHOOK_SECRET; + + describe('verifyScrapflySignature', () => { + it('returns true for a valid uppercase-hex signature', () => { + const body = Buffer.from('{"event":"crawler_finished"}'); + const sig = generateScrapflySignature(body, secret); + expect(verifyScrapflySignature(body, sig, secret)).toBe(true); + }); + + it('returns true for a lowercase-hex signature (the -Lowercase variant)', () => { + const body = Buffer.from('{"event":"crawler_finished"}'); + const sig = generateScrapflySignature(body, secret).toLowerCase(); + expect(verifyScrapflySignature(body, sig, secret)).toBe(true); + }); + + it('returns false for an invalid signature', () => { + const body = Buffer.from('{"event":"crawler_finished"}'); + expect(verifyScrapflySignature(body, 'AABBCC', secret)).toBe(false); + }); + + it('returns false for a missing signature', () => { + const body = Buffer.from('{}'); + expect(verifyScrapflySignature(body, null, secret)).toBe(false); + }); + + it('returns false for a missing secret', () => { + const body = Buffer.from('{}'); + const sig = generateScrapflySignature(body, secret); + expect(verifyScrapflySignature(body, sig, '')).toBe(false); + }); + + it('returns false for non-hex signature data', () => { + const body = Buffer.from('{}'); + expect(verifyScrapflySignature(body, 'NOT_HEX!!', secret)).toBe(false); + }); + }); + + describe('POST /webhooks/scrapfly', () => { + it('returns 401 when signature is missing', async () => { + const res = await request(app) + .post('/webhooks/scrapfly') + .set('Content-Type', 'application/json') + .set('X-Scrapfly-Webhook-Resource-Type', 'scrape') + .send('{"result":{"status_code":200}}'); + + expect(res.status).toBe(401); + expect(res.text).toBe('Invalid signature'); + }); + + it('returns 401 when signature is invalid', async () => { + const body = JSON.stringify({ result: { status_code: 200 } }); + const res = await request(app) + .post('/webhooks/scrapfly') + .set('Content-Type', 'application/json') + .set('X-Scrapfly-Webhook-Signature', 'DEADBEEF') + .set('X-Scrapfly-Webhook-Resource-Type', 'scrape') + .send(body); + + expect(res.status).toBe(401); + }); + + it('returns 401 when the body has been tampered after signing', async () => { + const originalBody = JSON.stringify({ result: { status_code: 200 } }); + const sig = generateScrapflySignature(Buffer.from(originalBody), secret); + + const tamperedBody = JSON.stringify({ result: { status_code: 500 } }); + const res = await request(app) + .post('/webhooks/scrapfly') + .set('Content-Type', 'application/json') + .set('X-Scrapfly-Webhook-Signature', sig) + .set('X-Scrapfly-Webhook-Resource-Type', 'scrape') + .send(tamperedBody); + + expect(res.status).toBe(401); + }); + + it('returns 200 for a valid scrape webhook', async () => { + const body = JSON.stringify({ + result: { + url: 'https://web-scraping.dev/products', + status_code: 200, + content: '', + }, + context: { + webhook: { name: 'my-webhook', secret: 'test_scrapfly_signing_secret', consecutive_failed_count: 0 }, + job: { uuid: '550e8400-e29b-41d4-a716-446655440000' }, + }, + }); + const sig = generateScrapflySignature(Buffer.from(body), secret); + + const res = await request(app) + .post('/webhooks/scrapfly') + .set('Content-Type', 'application/json') + .set('X-Scrapfly-Webhook-Signature', sig) + .set('X-Scrapfly-Webhook-Resource-Type', 'scrape') + .set('X-Scrapfly-Webhook-Id', 'wh_test_1') + .set('X-Scrapfly-Webhook-Job-Id', 'job_test_1') + .send(body); + + expect(res.status).toBe(200); + expect(res.text).toBe('OK'); + }); + + it('returns 200 for a valid extraction webhook', async () => { + const body = JSON.stringify({ result: { data: { title: 'Test' } } }); + const sig = generateScrapflySignature(Buffer.from(body), secret); + + const res = await request(app) + .post('/webhooks/scrapfly') + .set('Content-Type', 'application/json') + .set('X-Scrapfly-Webhook-Signature', sig) + .set('X-Scrapfly-Webhook-Resource-Type', 'extraction') + .send(body); + + expect(res.status).toBe(200); + }); + + it('returns 200 for a valid screenshot webhook', async () => { + const body = JSON.stringify({ + result: { screenshot_url: 'https://scrapfly.io/screenshots/abc.png' }, + }); + const sig = generateScrapflySignature(Buffer.from(body), secret); + + const res = await request(app) + .post('/webhooks/scrapfly') + .set('Content-Type', 'application/json') + .set('X-Scrapfly-Webhook-Signature', sig) + .set('X-Scrapfly-Webhook-Resource-Type', 'screenshot') + .send(body); + + expect(res.status).toBe(200); + }); + + const crawlerEvents = [ + 'crawler_started', + 'crawler_url_visited', + 'crawler_url_discovered', + 'crawler_url_skipped', + 'crawler_url_failed', + 'crawler_stopped', + 'crawler_cancelled', + 'crawler_finished', + ]; + + crawlerEvents.forEach((event) => { + it(`returns 200 for crawler event ${event}`, async () => { + const body = JSON.stringify({ + event, + payload: { + crawler_uuid: '550e8400-e29b-41d4-a716-446655440000', + url: 'https://web-scraping.dev/page', + }, + }); + const sig = generateScrapflySignature(Buffer.from(body), secret); + + const res = await request(app) + .post('/webhooks/scrapfly') + .set('Content-Type', 'application/json') + .set('X-Scrapfly-Webhook-Signature', sig) + .set('X-Scrapfly-Crawl-Event-Name', event) + .send(body); + + expect(res.status).toBe(200); + }); + }); + + it('accepts a lowercase-hex signature (X-Scrapfly-Webhook-Signature-Lowercase variant)', async () => { + const body = JSON.stringify({ result: { status_code: 200 } }); + const sig = generateScrapflySignature(Buffer.from(body), secret).toLowerCase(); + + const res = await request(app) + .post('/webhooks/scrapfly') + .set('Content-Type', 'application/json') + .set('X-Scrapfly-Webhook-Signature', sig) + .set('X-Scrapfly-Webhook-Resource-Type', 'scrape') + .send(body); + + expect(res.status).toBe(200); + }); + + it('returns 400 when the body is not valid JSON', async () => { + const malformed = '{ this is not json'; + const sig = generateScrapflySignature(Buffer.from(malformed), secret); + + const res = await request(app) + .post('/webhooks/scrapfly') + .set('Content-Type', 'application/json') + .set('X-Scrapfly-Webhook-Signature', sig) + .set('X-Scrapfly-Webhook-Resource-Type', 'scrape') + .send(malformed); + + expect(res.status).toBe(400); + expect(res.text).toBe('Invalid JSON payload'); + }); + }); + + describe('GET /health', () => { + it('returns ok', async () => { + const res = await request(app).get('/health'); + expect(res.status).toBe(200); + expect(res.body).toEqual({ status: 'ok' }); + }); + }); +}); diff --git a/skills/scrapfly-webhooks/examples/fastapi/.env.example b/skills/scrapfly-webhooks/examples/fastapi/.env.example new file mode 100644 index 0000000..98d6c24 --- /dev/null +++ b/skills/scrapfly-webhooks/examples/fastapi/.env.example @@ -0,0 +1,5 @@ +# Scrapfly webhook signing secret (copy from the Scrapfly dashboard webhook settings) +SCRAPFLY_WEBHOOK_SECRET=your_signing_secret_here + +# Optional: port for uvicorn (default 8000) +PORT=8000 diff --git a/skills/scrapfly-webhooks/examples/fastapi/README.md b/skills/scrapfly-webhooks/examples/fastapi/README.md new file mode 100644 index 0000000..054b1f7 --- /dev/null +++ b/skills/scrapfly-webhooks/examples/fastapi/README.md @@ -0,0 +1,60 @@ +# Scrapfly Webhooks - FastAPI Example + +Minimal FastAPI example of receiving Scrapfly webhooks with signature verification. + +## Prerequisites + +- Python 3.9+ +- A Scrapfly account with a webhook configured (see [setup.md](../../references/setup.md)) + +## Setup + +1. Create a virtual environment and install dependencies: + ```bash + python3 -m venv venv + source venv/bin/activate + pip install -r requirements.txt + ``` + +2. Copy environment variables: + ```bash + cp .env.example .env + ``` + +3. Add your Scrapfly webhook signing secret to `.env`: + ```bash + SCRAPFLY_WEBHOOK_SECRET= + ``` + +## Run + +```bash +uvicorn main:app --reload --port 8000 +``` + +Server runs on http://localhost:8000. + +## Test + +```bash +pytest test_webhook.py -v +``` + +The tests generate valid Scrapfly signatures (`upper(hex(HMAC_SHA256(secret, body)))`) — the same algorithm Scrapfly's docs document — and assert the endpoint accepts/rejects accordingly. + +## Receive Webhooks Locally + +```bash +npx hookdeck-cli listen 8000 scrapfly --path /webhooks/scrapfly +``` + +Paste the printed public URL into your Scrapfly dashboard webhook configuration. + +## Endpoint + +- `POST /webhooks/scrapfly` — Receives and verifies Scrapfly webhook deliveries +- `GET /health` — Health check + +## How It Works + +The handler reads the raw bytes with `await request.body()`, computes `hmac.new(secret, body, sha256).hexdigest().upper()`, and constant-time-compares it to `X-Scrapfly-Webhook-Signature` (uppercased). Only after verification does it `json.loads` the payload and route by `X-Scrapfly-Webhook-Resource-Type` (`scrape` / `extraction` / `screenshot`) or by the Crawler `event` field. diff --git a/skills/scrapfly-webhooks/examples/fastapi/main.py b/skills/scrapfly-webhooks/examples/fastapi/main.py new file mode 100644 index 0000000..9fc93b6 --- /dev/null +++ b/skills/scrapfly-webhooks/examples/fastapi/main.py @@ -0,0 +1,135 @@ +# Generated with: scrapfly-webhooks skill +# https://github.com/hookdeck/webhook-skills + +import hashlib +import hmac +import json +import os +from typing import Optional + +from dotenv import load_dotenv +from fastapi import FastAPI, Header, HTTPException, Request + +load_dotenv() + +app = FastAPI(title="Scrapfly Webhook Handler") + + +def verify_scrapfly_signature( + raw_body: bytes, + signature_header: Optional[str], + secret: str, +) -> bool: + """Verify a Scrapfly webhook signature. + + Algorithm: upper(hex(HMAC_SHA256(secret, raw_body))) + Header: X-Scrapfly-Webhook-Signature (uppercase hex) + """ + if not signature_header or not secret: + return False + + expected = hmac.new( + secret.encode("utf-8"), + raw_body, + hashlib.sha256, + ).hexdigest().upper() + + # Scrapfly also sends an X-Scrapfly-Webhook-Signature-Lowercase variant; + # normalise both sides to uppercase before constant-time comparison. + return hmac.compare_digest(expected, signature_header.upper()) + + +@app.post("/webhooks/scrapfly") +async def scrapfly_webhook( + request: Request, + x_scrapfly_webhook_signature: Optional[str] = Header( + None, alias="x-scrapfly-webhook-signature" + ), + x_scrapfly_webhook_resource_type: Optional[str] = Header( + None, alias="x-scrapfly-webhook-resource-type" + ), + x_scrapfly_webhook_id: Optional[str] = Header( + None, alias="x-scrapfly-webhook-id" + ), + x_scrapfly_webhook_job_id: Optional[str] = Header( + None, alias="x-scrapfly-webhook-job-id" + ), + x_scrapfly_crawl_event_name: Optional[str] = Header( + None, alias="x-scrapfly-crawl-event-name" + ), +): + # Read the raw body before any JSON parsing — re-serialising mutates bytes + # and breaks the signature. + raw_body = await request.body() + + secret = os.environ.get("SCRAPFLY_WEBHOOK_SECRET") + if not secret: + print("ERROR: SCRAPFLY_WEBHOOK_SECRET is not configured") + raise HTTPException(status_code=500, detail="Webhook secret not configured") + + if not verify_scrapfly_signature(raw_body, x_scrapfly_webhook_signature, secret): + print("ERROR: Scrapfly webhook signature verification failed") + raise HTTPException(status_code=401, detail="Invalid signature") + + try: + payload = json.loads(raw_body.decode("utf-8")) + except json.JSONDecodeError as exc: + print(f"ERROR: Failed to parse Scrapfly webhook payload: {exc}") + raise HTTPException(status_code=400, detail="Invalid JSON payload") + + print( + f"Scrapfly webhook (id={x_scrapfly_webhook_id} " + f"resource={x_scrapfly_webhook_resource_type} job={x_scrapfly_webhook_job_id})" + ) + + resource_type = x_scrapfly_webhook_resource_type + + if resource_type == "scrape": + # Scrape API places the fetched URL at result.url. The webhook overlay's + # payload["context"] only carries `webhook` and `job` sub-objects. + result = payload.get("result", {}) + print(f"Scrape result: url={result.get('url')} status={result.get('status_code')}") + # TODO: Persist HTML / extracted fields, enqueue parsing + elif resource_type == "extraction": + print(f"Extraction result: {payload.get('result', {}).get('data')}") + # TODO: Save structured data, trigger enrichment + elif resource_type == "screenshot": + print(f"Screenshot URL: {payload.get('result', {}).get('screenshot_url')}") + # TODO: Store image, generate thumbnail + else: + # Crawler API uses lifecycle events in the body. + event = x_scrapfly_crawl_event_name or payload.get("event") + crawler_payload = payload.get("payload", {}) + + if event == "crawler_started": + print(f"Crawler started: {crawler_payload.get('crawler_uuid')}") + elif event == "crawler_url_visited": + print(f"Crawler visited: {crawler_payload.get('url')}") + elif event == "crawler_url_discovered": + print(f"Crawler discovered: {crawler_payload.get('url')}") + elif event == "crawler_url_skipped": + print(f"Crawler skipped: {crawler_payload.get('url')}") + elif event == "crawler_url_failed": + print(f"Crawler failed: {crawler_payload.get('url')}") + elif event == "crawler_stopped": + print(f"Crawler stopped: {crawler_payload.get('crawler_uuid')}") + elif event == "crawler_cancelled": + print(f"Crawler cancelled: {crawler_payload.get('crawler_uuid')}") + elif event == "crawler_finished": + print(f"Crawler finished: {crawler_payload.get('crawler_uuid')}") + else: + print(f"Unhandled Scrapfly webhook: resource={resource_type} event={event}") + + return {"received": True} + + +@app.get("/health") +async def health_check(): + return {"status": "ok"} + + +if __name__ == "__main__": + import uvicorn + + port = int(os.environ.get("PORT", 8000)) + uvicorn.run(app, host="0.0.0.0", port=port) diff --git a/skills/scrapfly-webhooks/examples/fastapi/requirements.txt b/skills/scrapfly-webhooks/examples/fastapi/requirements.txt new file mode 100644 index 0000000..9794bec --- /dev/null +++ b/skills/scrapfly-webhooks/examples/fastapi/requirements.txt @@ -0,0 +1,5 @@ +fastapi>=0.136.1 +uvicorn[standard]>=0.36.0 +python-dotenv>=1.0.0 +pytest>=9.0.3 +httpx>=0.28.1 diff --git a/skills/scrapfly-webhooks/examples/fastapi/test_webhook.py b/skills/scrapfly-webhooks/examples/fastapi/test_webhook.py new file mode 100644 index 0000000..20271cd --- /dev/null +++ b/skills/scrapfly-webhooks/examples/fastapi/test_webhook.py @@ -0,0 +1,215 @@ +import hashlib +import hmac +import json +import os + +import pytest +from fastapi.testclient import TestClient + +os.environ["SCRAPFLY_WEBHOOK_SECRET"] = "test_scrapfly_signing_secret" + +from main import app # noqa: E402 (imports after setting env vars) + + +def generate_scrapfly_signature(raw_body: bytes, secret: str) -> str: + return hmac.new( + secret.encode("utf-8"), + raw_body, + hashlib.sha256, + ).hexdigest().upper() + + +@pytest.fixture +def client(): + return TestClient(app) + + +@pytest.fixture +def secret(): + return os.environ["SCRAPFLY_WEBHOOK_SECRET"] + + +class TestScrapflyWebhook: + def test_missing_signature(self, client): + response = client.post( + "/webhooks/scrapfly", + content=b"{}", + headers={ + "Content-Type": "application/json", + "X-Scrapfly-Webhook-Resource-Type": "scrape", + }, + ) + assert response.status_code == 401 + assert response.json()["detail"] == "Invalid signature" + + def test_invalid_signature(self, client): + body = json.dumps({"result": {"status_code": 200}}).encode("utf-8") + response = client.post( + "/webhooks/scrapfly", + content=body, + headers={ + "Content-Type": "application/json", + "X-Scrapfly-Webhook-Signature": "DEADBEEF", + "X-Scrapfly-Webhook-Resource-Type": "scrape", + }, + ) + assert response.status_code == 401 + + def test_tampered_body(self, client, secret): + original = json.dumps({"result": {"status_code": 200}}).encode("utf-8") + sig = generate_scrapfly_signature(original, secret) + + tampered = json.dumps({"result": {"status_code": 500}}).encode("utf-8") + response = client.post( + "/webhooks/scrapfly", + content=tampered, + headers={ + "Content-Type": "application/json", + "X-Scrapfly-Webhook-Signature": sig, + "X-Scrapfly-Webhook-Resource-Type": "scrape", + }, + ) + assert response.status_code == 401 + + def test_valid_scrape_webhook(self, client, secret): + body = json.dumps( + { + "result": { + "url": "https://web-scraping.dev/products", + "status_code": 200, + "content": "", + }, + "context": { + "webhook": { + "name": "my-webhook", + "secret": secret, + "consecutive_failed_count": 0, + }, + "job": {"uuid": "550e8400-e29b-41d4-a716-446655440000"}, + }, + } + ).encode("utf-8") + sig = generate_scrapfly_signature(body, secret) + + response = client.post( + "/webhooks/scrapfly", + content=body, + headers={ + "Content-Type": "application/json", + "X-Scrapfly-Webhook-Signature": sig, + "X-Scrapfly-Webhook-Resource-Type": "scrape", + "X-Scrapfly-Webhook-Id": "wh_test_1", + "X-Scrapfly-Webhook-Job-Id": "job_test_1", + }, + ) + assert response.status_code == 200 + assert response.json() == {"received": True} + + @pytest.mark.parametrize( + "resource_type", + ["scrape", "extraction", "screenshot"], + ) + def test_resource_types(self, client, secret, resource_type): + body = json.dumps({"result": {"status_code": 200}}).encode("utf-8") + sig = generate_scrapfly_signature(body, secret) + + response = client.post( + "/webhooks/scrapfly", + content=body, + headers={ + "Content-Type": "application/json", + "X-Scrapfly-Webhook-Signature": sig, + "X-Scrapfly-Webhook-Resource-Type": resource_type, + }, + ) + assert response.status_code == 200 + + @pytest.mark.parametrize( + "event", + [ + "crawler_started", + "crawler_url_visited", + "crawler_url_discovered", + "crawler_url_skipped", + "crawler_url_failed", + "crawler_stopped", + "crawler_cancelled", + "crawler_finished", + ], + ) + def test_crawler_events(self, client, secret, event): + body = json.dumps( + { + "event": event, + "payload": { + "crawler_uuid": "550e8400-e29b-41d4-a716-446655440000", + "url": "https://web-scraping.dev/page", + }, + } + ).encode("utf-8") + sig = generate_scrapfly_signature(body, secret) + + response = client.post( + "/webhooks/scrapfly", + content=body, + headers={ + "Content-Type": "application/json", + "X-Scrapfly-Webhook-Signature": sig, + "X-Scrapfly-Crawl-Event-Name": event, + }, + ) + assert response.status_code == 200 + + def test_lowercase_signature_variant(self, client, secret): + # Scrapfly also sends X-Scrapfly-Webhook-Signature-Lowercase; the handler + # accepts either casing. + body = json.dumps({"result": {"status_code": 200}}).encode("utf-8") + sig = generate_scrapfly_signature(body, secret).lower() + + response = client.post( + "/webhooks/scrapfly", + content=body, + headers={ + "Content-Type": "application/json", + "X-Scrapfly-Webhook-Signature": sig, + "X-Scrapfly-Webhook-Resource-Type": "scrape", + }, + ) + assert response.status_code == 200 + + def test_invalid_json_body(self, client, secret): + malformed = b"{ not json" + sig = generate_scrapfly_signature(malformed, secret) + + response = client.post( + "/webhooks/scrapfly", + content=malformed, + headers={ + "Content-Type": "application/json", + "X-Scrapfly-Webhook-Signature": sig, + "X-Scrapfly-Webhook-Resource-Type": "scrape", + }, + ) + assert response.status_code == 400 + assert response.json()["detail"] == "Invalid JSON payload" + + def test_missing_secret(self, client): + original = os.environ.pop("SCRAPFLY_WEBHOOK_SECRET", None) + try: + response = client.post( + "/webhooks/scrapfly", + content=b"{}", + headers={"Content-Type": "application/json"}, + ) + assert response.status_code == 500 + assert response.json()["detail"] == "Webhook secret not configured" + finally: + if original is not None: + os.environ["SCRAPFLY_WEBHOOK_SECRET"] = original + + +class TestHealth: + def test_health(self, client): + response = client.get("/health") + assert response.status_code == 200 + assert response.json() == {"status": "ok"} diff --git a/skills/scrapfly-webhooks/examples/nextjs/.env.example b/skills/scrapfly-webhooks/examples/nextjs/.env.example new file mode 100644 index 0000000..36c4513 --- /dev/null +++ b/skills/scrapfly-webhooks/examples/nextjs/.env.example @@ -0,0 +1,2 @@ +# Scrapfly webhook signing secret (copy from the Scrapfly dashboard webhook settings) +SCRAPFLY_WEBHOOK_SECRET=your_signing_secret_here diff --git a/skills/scrapfly-webhooks/examples/nextjs/README.md b/skills/scrapfly-webhooks/examples/nextjs/README.md new file mode 100644 index 0000000..99b0624 --- /dev/null +++ b/skills/scrapfly-webhooks/examples/nextjs/README.md @@ -0,0 +1,57 @@ +# Scrapfly Webhooks - Next.js Example + +Minimal Next.js App Router example of receiving Scrapfly webhooks with signature verification. + +## Prerequisites + +- Node.js 18+ +- A Scrapfly account with a webhook configured (see [setup.md](../../references/setup.md)) + +## Setup + +1. Install dependencies: + ```bash + npm install + ``` + +2. Copy environment variables: + ```bash + cp .env.example .env + ``` + +3. Add your Scrapfly webhook signing secret to `.env`: + ```bash + SCRAPFLY_WEBHOOK_SECRET= + ``` + +## Run + +```bash +npm run dev +``` + +Server runs on http://localhost:3000. + +## Test + +```bash +npm test +``` + +The test suite generates valid Scrapfly signatures (`upper(hex(HMAC_SHA256(secret, body)))`) and asserts the route accepts/rejects accordingly. + +## Receive Webhooks Locally + +```bash +npx hookdeck-cli listen 3000 scrapfly --path /webhooks/scrapfly +``` + +Paste the printed public URL into the Scrapfly dashboard webhook configuration. + +## Endpoint + +- `POST /webhooks/scrapfly` — `app/webhooks/scrapfly/route.ts` + +## How It Works + +The route reads the request as raw text with `await request.text()` (so the bytes are exactly what Scrapfly signed), verifies `X-Scrapfly-Webhook-Signature` with `crypto.timingSafeEqual`, and only then `JSON.parse`s the payload and routes by `X-Scrapfly-Webhook-Resource-Type` or the body's `event` field for Crawler events. diff --git a/skills/scrapfly-webhooks/examples/nextjs/app/webhooks/scrapfly/route.ts b/skills/scrapfly-webhooks/examples/nextjs/app/webhooks/scrapfly/route.ts new file mode 100644 index 0000000..e30b4f1 --- /dev/null +++ b/skills/scrapfly-webhooks/examples/nextjs/app/webhooks/scrapfly/route.ts @@ -0,0 +1,128 @@ +// Generated with: scrapfly-webhooks skill +// https://github.com/hookdeck/webhook-skills + +import { NextRequest, NextResponse } from 'next/server'; +import { createHmac, timingSafeEqual } from 'crypto'; + +/** + * Verify a Scrapfly webhook signature. + * + * Algorithm: upper(hex(HMAC_SHA256(secret, rawBody))) + * Header: X-Scrapfly-Webhook-Signature (uppercase hex) + */ +function verifyScrapflySignature( + rawBody: string, + signatureHeader: string | null, + secret: string +): boolean { + if (!signatureHeader || !secret) { + return false; + } + + const expected = createHmac('sha256', secret) + .update(rawBody) + .digest('hex') + .toUpperCase(); + + // Scrapfly also sends an X-Scrapfly-Webhook-Signature-Lowercase variant; + // normalise to uppercase so either header works. + const received = signatureHeader.toUpperCase(); + + try { + return timingSafeEqual( + Buffer.from(received, 'hex'), + Buffer.from(expected, 'hex') + ); + } catch { + return false; + } +} + +export async function POST(request: NextRequest) { + // Read raw body as text — JSON.parse + re-stringify would change the bytes + // and break the signature. + const rawBody = await request.text(); + + const signature = request.headers.get('x-scrapfly-webhook-signature'); + const resourceType = request.headers.get('x-scrapfly-webhook-resource-type'); + const webhookId = request.headers.get('x-scrapfly-webhook-id'); + const jobId = request.headers.get('x-scrapfly-webhook-job-id'); + const crawlEvent = request.headers.get('x-scrapfly-crawl-event-name'); + + const secret = process.env.SCRAPFLY_WEBHOOK_SECRET; + if (!secret) { + console.error('SCRAPFLY_WEBHOOK_SECRET is not configured'); + return NextResponse.json( + { error: 'Webhook secret not configured' }, + { status: 500 } + ); + } + + if (!verifyScrapflySignature(rawBody, signature, secret)) { + console.error('Scrapfly webhook signature verification failed'); + return NextResponse.json({ error: 'Invalid signature' }, { status: 401 }); + } + + let payload: any; + try { + payload = JSON.parse(rawBody); + } catch (err) { + console.error('Failed to parse Scrapfly webhook payload:', err); + return NextResponse.json({ error: 'Invalid JSON payload' }, { status: 400 }); + } + + console.log(`Scrapfly webhook (id=${webhookId} resource=${resourceType} job=${jobId})`); + + switch (resourceType) { + case 'scrape': + // Scrape API places the fetched URL at result.url. The webhook overlay's + // payload.context only carries `webhook` and `job` sub-objects. + console.log('Scrape result:', { + url: payload?.result?.url, + status: payload?.result?.status_code, + }); + break; + + case 'extraction': + console.log('Extraction result:', payload?.result?.data); + break; + + case 'screenshot': + console.log('Screenshot result URL:', payload?.result?.screenshot_url); + break; + + default: { + const event = crawlEvent || payload?.event; + switch (event) { + case 'crawler_started': + console.log('Crawler started:', payload?.payload?.crawler_uuid); + break; + case 'crawler_url_visited': + console.log('Crawler visited:', payload?.payload?.url); + break; + case 'crawler_url_discovered': + console.log('Crawler discovered:', payload?.payload?.url); + break; + case 'crawler_url_skipped': + console.log('Crawler skipped:', payload?.payload?.url); + break; + case 'crawler_url_failed': + console.log('Crawler failed:', payload?.payload?.url); + break; + case 'crawler_stopped': + console.log('Crawler stopped:', payload?.payload?.crawler_uuid); + break; + case 'crawler_cancelled': + console.log('Crawler cancelled:', payload?.payload?.crawler_uuid); + break; + case 'crawler_finished': + console.log('Crawler finished:', payload?.payload?.crawler_uuid); + break; + default: + console.log('Unhandled Scrapfly webhook:', { resourceType, event }); + } + } + } + + return NextResponse.json({ received: true }); +} diff --git a/skills/scrapfly-webhooks/examples/nextjs/package.json b/skills/scrapfly-webhooks/examples/nextjs/package.json new file mode 100644 index 0000000..ab489ed --- /dev/null +++ b/skills/scrapfly-webhooks/examples/nextjs/package.json @@ -0,0 +1,29 @@ +{ + "name": "scrapfly-webhooks-nextjs", + "version": "1.0.0", + "description": "Next.js example for receiving Scrapfly webhooks", + "private": true, + "scripts": { + "dev": "next dev", + "build": "next build", + "start": "next start", + "test": "vitest run", + "test:watch": "vitest" + }, + "dependencies": { + "next": "^16.2.6", + "react": "^19.0.0", + "react-dom": "^19.0.0" + }, + "devDependencies": { + "@types/node": "^22.0.0", + "@types/react": "^19.0.0", + "@types/react-dom": "^19.0.0", + "typescript": "^6.0.3", + "vitest": "^4.1.5", + "@vitejs/plugin-react": "^4.0.0" + }, + "engines": { + "node": ">=18.0.0" + } +} diff --git a/skills/scrapfly-webhooks/examples/nextjs/test/setup.ts b/skills/scrapfly-webhooks/examples/nextjs/test/setup.ts new file mode 100644 index 0000000..49fd067 --- /dev/null +++ b/skills/scrapfly-webhooks/examples/nextjs/test/setup.ts @@ -0,0 +1,2 @@ +// Set test environment variables before any test file imports the route. +process.env.SCRAPFLY_WEBHOOK_SECRET = 'test_scrapfly_signing_secret'; diff --git a/skills/scrapfly-webhooks/examples/nextjs/test/webhook.test.ts b/skills/scrapfly-webhooks/examples/nextjs/test/webhook.test.ts new file mode 100644 index 0000000..7cc77a8 --- /dev/null +++ b/skills/scrapfly-webhooks/examples/nextjs/test/webhook.test.ts @@ -0,0 +1,154 @@ +import { describe, it, expect } from 'vitest'; +import { NextRequest } from 'next/server'; +import { createHmac } from 'crypto'; +import { POST } from '../app/webhooks/scrapfly/route'; + +const secret = process.env.SCRAPFLY_WEBHOOK_SECRET!; + +function generateScrapflySignature(rawBody: string, secret: string): string { + return createHmac('sha256', secret).update(rawBody).digest('hex').toUpperCase(); +} + +function makeRequest(body: string, headers: Record = {}): NextRequest { + return new NextRequest('http://localhost:3000/webhooks/scrapfly', { + method: 'POST', + headers: { 'Content-Type': 'application/json', ...headers }, + body, + }); +} + +describe('Scrapfly Webhook Endpoint (Next.js)', () => { + it('returns 401 when signature header is missing', async () => { + const body = JSON.stringify({ result: { status_code: 200 } }); + const res = await POST(makeRequest(body, { + 'X-Scrapfly-Webhook-Resource-Type': 'scrape', + })); + + expect(res.status).toBe(401); + expect(await res.json()).toEqual({ error: 'Invalid signature' }); + }); + + it('returns 401 when signature is invalid', async () => { + const body = JSON.stringify({ result: { status_code: 200 } }); + const res = await POST(makeRequest(body, { + 'X-Scrapfly-Webhook-Signature': 'DEADBEEF', + 'X-Scrapfly-Webhook-Resource-Type': 'scrape', + })); + + expect(res.status).toBe(401); + }); + + it('returns 401 when body is tampered after signing', async () => { + const originalBody = JSON.stringify({ result: { status_code: 200 } }); + const sig = generateScrapflySignature(originalBody, secret); + + const tampered = JSON.stringify({ result: { status_code: 500 } }); + const res = await POST(makeRequest(tampered, { + 'X-Scrapfly-Webhook-Signature': sig, + 'X-Scrapfly-Webhook-Resource-Type': 'scrape', + })); + + expect(res.status).toBe(401); + }); + + it('returns 200 for a valid scrape webhook', async () => { + const body = JSON.stringify({ + result: { url: 'https://web-scraping.dev/products', status_code: 200 }, + context: { + webhook: { name: 'my-webhook', secret, consecutive_failed_count: 0 }, + job: { uuid: '550e8400-e29b-41d4-a716-446655440000' }, + }, + }); + const sig = generateScrapflySignature(body, secret); + + const res = await POST(makeRequest(body, { + 'X-Scrapfly-Webhook-Signature': sig, + 'X-Scrapfly-Webhook-Resource-Type': 'scrape', + 'X-Scrapfly-Webhook-Id': 'wh_test_1', + 'X-Scrapfly-Webhook-Job-Id': 'job_test_1', + })); + + expect(res.status).toBe(200); + expect(await res.json()).toEqual({ received: true }); + }); + + it.each([ + 'scrape', + 'extraction', + 'screenshot', + ])('returns 200 for resource type %s', async (resourceType) => { + const body = JSON.stringify({ result: { status_code: 200 } }); + const sig = generateScrapflySignature(body, secret); + + const res = await POST(makeRequest(body, { + 'X-Scrapfly-Webhook-Signature': sig, + 'X-Scrapfly-Webhook-Resource-Type': resourceType, + })); + + expect(res.status).toBe(200); + }); + + it.each([ + 'crawler_started', + 'crawler_url_visited', + 'crawler_url_discovered', + 'crawler_url_skipped', + 'crawler_url_failed', + 'crawler_stopped', + 'crawler_cancelled', + 'crawler_finished', + ])('returns 200 for crawler event %s', async (event) => { + const body = JSON.stringify({ + event, + payload: { + crawler_uuid: '550e8400-e29b-41d4-a716-446655440000', + url: 'https://web-scraping.dev/page', + }, + }); + const sig = generateScrapflySignature(body, secret); + + const res = await POST(makeRequest(body, { + 'X-Scrapfly-Webhook-Signature': sig, + 'X-Scrapfly-Crawl-Event-Name': event, + })); + + expect(res.status).toBe(200); + }); + + it('accepts a lowercase-hex signature (X-Scrapfly-Webhook-Signature-Lowercase variant)', async () => { + const body = JSON.stringify({ result: { status_code: 200 } }); + const sig = generateScrapflySignature(body, secret).toLowerCase(); + + const res = await POST(makeRequest(body, { + 'X-Scrapfly-Webhook-Signature': sig, + 'X-Scrapfly-Webhook-Resource-Type': 'scrape', + })); + + expect(res.status).toBe(200); + }); + + it('returns 400 when body is not valid JSON', async () => { + const malformed = '{ not json'; + const sig = generateScrapflySignature(malformed, secret); + + const res = await POST(makeRequest(malformed, { + 'X-Scrapfly-Webhook-Signature': sig, + 'X-Scrapfly-Webhook-Resource-Type': 'scrape', + })); + + expect(res.status).toBe(400); + expect(await res.json()).toEqual({ error: 'Invalid JSON payload' }); + }); + + it('returns 500 if SCRAPFLY_WEBHOOK_SECRET is not set', async () => { + const original = process.env.SCRAPFLY_WEBHOOK_SECRET; + delete process.env.SCRAPFLY_WEBHOOK_SECRET; + try { + const res = await POST(makeRequest('{}')); + expect(res.status).toBe(500); + expect(await res.json()).toEqual({ error: 'Webhook secret not configured' }); + } finally { + process.env.SCRAPFLY_WEBHOOK_SECRET = original; + } + }); +}); diff --git a/skills/scrapfly-webhooks/examples/nextjs/tsconfig.json b/skills/scrapfly-webhooks/examples/nextjs/tsconfig.json new file mode 100644 index 0000000..e7ff90f --- /dev/null +++ b/skills/scrapfly-webhooks/examples/nextjs/tsconfig.json @@ -0,0 +1,26 @@ +{ + "compilerOptions": { + "lib": ["dom", "dom.iterable", "esnext"], + "allowJs": true, + "skipLibCheck": true, + "strict": true, + "noEmit": true, + "esModuleInterop": true, + "module": "esnext", + "moduleResolution": "bundler", + "resolveJsonModule": true, + "isolatedModules": true, + "jsx": "preserve", + "incremental": true, + "plugins": [ + { + "name": "next" + } + ], + "paths": { + "@/*": ["./*"] + } + }, + "include": ["next-env.d.ts", "**/*.ts", "**/*.tsx", ".next/types/**/*.ts"], + "exclude": ["node_modules"] +} diff --git a/skills/scrapfly-webhooks/examples/nextjs/vitest.config.ts b/skills/scrapfly-webhooks/examples/nextjs/vitest.config.ts new file mode 100644 index 0000000..6ae7f37 --- /dev/null +++ b/skills/scrapfly-webhooks/examples/nextjs/vitest.config.ts @@ -0,0 +1,17 @@ +import { defineConfig } from 'vitest/config'; +import react from '@vitejs/plugin-react'; +import path from 'path'; + +export default defineConfig({ + plugins: [react()], + test: { + globals: true, + environment: 'node', + setupFiles: ['./test/setup.ts'], + }, + resolve: { + alias: { + '@': path.resolve(__dirname, './'), + }, + }, +}); diff --git a/skills/scrapfly-webhooks/references/overview.md b/skills/scrapfly-webhooks/references/overview.md new file mode 100644 index 0000000..3f6a23e --- /dev/null +++ b/skills/scrapfly-webhooks/references/overview.md @@ -0,0 +1,124 @@ +# Scrapfly Webhooks Overview + +## What Are Scrapfly Webhooks? + +Scrapfly is a web scraping API. When you submit a long-running job (async scrape, extraction, screenshot, or a Crawler run), Scrapfly delivers the result to a webhook endpoint you configure in the dashboard. + +A webhook is identified by a **name** in the dashboard. You attach it to a request by passing `webhook_name=` on the Scrape / Extraction / Screenshot API call, or by configuring it on the Crawler job. Scrapfly then POSTs the result (or, for the Crawler, lifecycle events) to your endpoint. + +## Resource Types + +The `X-Scrapfly-Webhook-Resource-Type` header tells you which product the delivery came from. Use it to dispatch when one endpoint handles multiple Scrapfly products: + +| Resource Type | Triggered When | Common Use Cases | +|---------------|----------------|------------------| +| `scrape` | An async Scrape API job finishes | Save HTML / extracted fields, kick off downstream parsing | +| `extraction` | An async Extraction API job finishes | Persist structured data, enqueue follow-up enrichment | +| `screenshot` | An async Screenshot API job finishes | Store image URL, notify users, generate thumbnails | + +The body of a `scrape` / `extraction` / `screenshot` webhook is the full JSON response of the corresponding synchronous API call with a `context` overlay added: + +```json +{ + "...api_response": "...", + "context": { + "...api_context": "...", + "webhook": { + "name": "my-webhook", + "secret": "", + "consecutive_failed_count": 0 + }, + "job": { + "uuid": "550e8400-e29b-41d4-a716-446655440000" + } + } +} +``` + +The webhook overlay always carries: + +- `context.webhook.name` — webhook name configured in the dashboard +- `context.webhook.secret` — the signing secret (**never log or echo this field**) +- `context.webhook.consecutive_failed_count` — current consecutive-failure count +- `context.job.uuid` — job UUID (same value as `X-Scrapfly-Webhook-Job-Id`) + +Product-specific fields (such as `result.content`, `result.data`, `result.screenshot_url`, or the API's own `context.url`) come from the underlying API response — see the [Scrape](https://scrapfly.io/docs/scrape-api/getting-started), [Extraction](https://scrapfly.io/docs/extraction-api/getting-started), and [Screenshot](https://scrapfly.io/docs/screenshot-api/getting-started) getting-started pages for shapes. + +## Crawler Events + +The Crawler API is a separate product that delivers **lifecycle events** rather than a single result. Each event has an `event` field in the body (and an `X-Scrapfly-Crawl-Event-Name` header): + +| Event | Default? | Triggered When | +|-------|----------|----------------| +| `crawler_started` | Yes | Crawl job started | +| `crawler_stopped` | Yes | The crawl stopped (budget/limit reached) | +| `crawler_cancelled` | Yes | The crawl was cancelled | +| `crawler_finished` | Yes | The crawl ran to completion | +| `crawler_url_visited` | Opt-in | A URL was fetched successfully | +| `crawler_url_discovered` | Opt-in | A new URL was added to the queue | +| `crawler_url_skipped` | Opt-in | A URL was skipped (deduped, filtered) | +| `crawler_url_failed` | Opt-in | A URL fetch failed | + +By default Scrapfly only delivers the four lifecycle events: `crawler_started`, `crawler_stopped`, `crawler_cancelled`, `crawler_finished`. The per-URL events (`crawler_url_visited`, `crawler_url_discovered`, `crawler_url_skipped`, `crawler_url_failed`) are high-volume and must be enabled explicitly via the `webhook_events` parameter when submitting the crawl job. + +Example Crawler payload: + +```json +{ + "event": "crawler_url_visited", + "payload": { + "crawler_uuid": "550e8400-e29b-41d4-a716-446655440000", + "url": "https://web-scraping.dev/page", + "status_code": 200, + "depth": 1, + "state": { + "urls_visited": 42, + "urls_to_crawl": 158, + "api_credit_used": 420 + } + } +} +``` + +## Common Headers + +| Header | Description | +|--------|-------------| +| `X-Scrapfly-Webhook-Signature` | HMAC-SHA256 of the raw body, **uppercase hex** | +| `X-Scrapfly-Webhook-Signature-Lowercase` | Same signature in lowercase hex | +| `X-Scrapfly-Webhook-Id` | Unique webhook delivery ID — use for idempotency | +| `X-Scrapfly-Webhook-Name` | Name of the webhook configured in the dashboard | +| `X-Scrapfly-Webhook-Resource-Type` | `scrape`, `extraction`, or `screenshot` | +| `X-Scrapfly-Webhook-Job-Id` | Job UUID returned at enqueue time — reconciliation key | +| `X-Scrapfly-Webhook-Env` | Environment label (`test` or `live`) | +| `X-Scrapfly-Webhook-Project` | Project name | +| `X-Scrapfly-Crawl-Event-Name` | Crawler API event name (e.g. `crawler_finished`) | +| `X-Scrapfly-Log-Uuid` / `X-Scrapfly-Log-Url` | Pointers to the Scrapfly log entry for the delivery | + +## Delivery & Retries + +Scrapfly delivery is **at-least-once**. Use `X-Scrapfly-Webhook-Job-Id` as your idempotency key — duplicates carry the same job UUID. + +Retry schedule on non-2xx responses (or timeout): + +| Attempt | Delay after previous | +|---------|----------------------| +| 1 | initial delivery | +| 2 | 30 s | +| 3 | 1 min | +| 4 | 5 min | +| 5 | 30 min | +| 6 | 1 h | +| 7 | 1 d | + +After **100 consecutive failures** Scrapfly automatically **disables** the webhook — no further deliveries are attempted until you re-enable it in the dashboard. Because of this, handlers should: + +- Return 2xx as soon as the signature is verified and the job is enqueued. +- Surface processing errors out-of-band (logs, alerts, dead-letter queue) rather than 5xx-ing back to Scrapfly. + +## Full Event Reference + +- [Scrape API webhook](https://scrapfly.io/docs/scrape-api/webhook) +- [Extraction API webhook](https://scrapfly.io/docs/extraction-api/webhook) +- [Screenshot API webhook](https://scrapfly.io/docs/screenshot-api/webhook) +- [Crawler API getting started](https://scrapfly.io/docs/crawler-api/getting-started) diff --git a/skills/scrapfly-webhooks/references/setup.md b/skills/scrapfly-webhooks/references/setup.md new file mode 100644 index 0000000..62e98b0 --- /dev/null +++ b/skills/scrapfly-webhooks/references/setup.md @@ -0,0 +1,71 @@ +# Setting Up Scrapfly Webhooks + +## Prerequisites + +- A Scrapfly account ([sign up](https://scrapfly.io)) +- A **paid Scrapfly plan**. Webhooks are not available on the FREE plan — its webhook queue size is 0, so no deliveries are ever dispatched even after configuration. Any paid tier enables delivery. +- A publicly reachable webhook endpoint URL (use [Hookdeck CLI](https://hookdeck.com/docs/cli) for local development) + +## Create a Webhook in the Scrapfly Dashboard + +1. Sign in to your Scrapfly dashboard at [scrapfly.io](https://scrapfly.io). +2. Go to **Webhooks** in the navigation. +3. Click **Create Webhook**. +4. Fill in: + - **Name** — A short identifier. You will pass this as `webhook_name=` on API calls. Names are scoped per project + environment. + - **URL** — Your endpoint, e.g. `https://your-app.example.com/webhooks/scrapfly`. + - (Optional) Environment / project scoping. +5. Save the webhook. Scrapfly will display a **signing secret** — copy it. The dashboard is the only place this secret is shown. + +## Configure the Signing Secret in Your App + +Add the secret to your `.env`: + +```bash +SCRAPFLY_WEBHOOK_SECRET= +``` + +Use it **exactly as shown** in the dashboard. Do not trim, base64-decode, or otherwise transform it — Scrapfly treats it as a raw UTF-8 string. + +## Trigger a Delivery + +### Scrape / Extraction / Screenshot APIs + +Pass `webhook_name` on the API call. Example for the Scrape API: + +```bash +curl "https://api.scrapfly.io/scrape?key=$SCRAPFLY_KEY&url=https://web-scraping.dev/products&webhook_name=my-webhook&async=true" +``` + +The call returns immediately with a `job_uuid`. When the job finishes, Scrapfly POSTs the result to your endpoint with: + +- `X-Scrapfly-Webhook-Resource-Type: scrape` +- `X-Scrapfly-Webhook-Job-Id: ` +- `X-Scrapfly-Webhook-Signature: ` + +The same pattern works for `https://api.scrapfly.io/extraction` (resource type `extraction`) and `https://api.scrapfly.io/screenshot` (resource type `screenshot`). + +### Crawler API + +Attach a webhook to a Crawler job when you submit it. Scrapfly will POST lifecycle events (`crawler_started`, `crawler_url_visited`, ..., `crawler_finished`) to your endpoint. The event name is also in the body's `event` field and in `X-Scrapfly-Crawl-Event-Name`. + +## Verify Locally with Hookdeck CLI + +No account or install needed: + +```bash +# Forward incoming webhooks to your local server +npx hookdeck-cli listen 3000 scrapfly --path /webhooks/scrapfly +``` + +The CLI prints a public URL — paste that into the **URL** field when creating the webhook in the Scrapfly dashboard. Trigger a job with `async=true&webhook_name=` and watch the request appear in the Hookdeck UI. + +## Environments + +Scrapfly webhooks are scoped per **project** and **environment**. The delivery includes `X-Scrapfly-Webhook-Env` and `X-Scrapfly-Webhook-Project` headers so you can keep one endpoint for multiple environments. + +## Reference + +- [Scrape API webhook docs](https://scrapfly.io/docs/scrape-api/webhook) +- [Extraction API webhook docs](https://scrapfly.io/docs/extraction-api/webhook) +- [Screenshot API webhook docs](https://scrapfly.io/docs/screenshot-api/webhook) diff --git a/skills/scrapfly-webhooks/references/verification.md b/skills/scrapfly-webhooks/references/verification.md new file mode 100644 index 0000000..e8994d7 --- /dev/null +++ b/skills/scrapfly-webhooks/references/verification.md @@ -0,0 +1,106 @@ +# Scrapfly Signature Verification + +## How It Works + +Scrapfly signs every webhook with **HMAC-SHA256** over the **raw request body bytes**. The digest is emitted as **uppercase hexadecimal** in the `X-Scrapfly-Webhook-Signature` header. A duplicate lowercase variant is sent as `X-Scrapfly-Webhook-Signature-Lowercase` for runtimes that normalise headers. + +There is **no timestamp** in the scheme and **no replay window** — treat the signature as authenticity-only. (If you need replay protection, gate processing on the `X-Scrapfly-Webhook-Id` header or the job UUID.) + +## Algorithm + +``` +signature = upper(hex(HMAC_SHA256(secret_utf8, raw_body_bytes))) +``` + +Compare with `received == signature` using a constant-time comparison. + +## Implementation + +Scrapfly does not publish an SDK for webhook verification — implementations follow the documented algorithm manually. + +### Node.js / Express / Next.js + +```javascript +const crypto = require('crypto'); + +function verifyScrapflySignature(rawBody, signatureHeader, secret) { + if (!signatureHeader || !secret) return false; + + const expected = crypto + .createHmac('sha256', secret) + .update(rawBody) + .digest('hex') + .toUpperCase(); + + const received = signatureHeader.toUpperCase(); + + try { + return crypto.timingSafeEqual( + Buffer.from(received, 'hex'), + Buffer.from(expected, 'hex') + ); + } catch { + return false; + } +} +``` + +Notes: +- `rawBody` must be a `Buffer` (Express) or the raw `string` from `await request.text()` (Next.js). **Never** `JSON.parse` and re-stringify — that mutates whitespace/key order and breaks the signature. +- `crypto.timingSafeEqual` requires equal-length buffers; the `try/catch` swallows length mismatches so the function returns `false` rather than throwing. + +### Python / FastAPI + +```python +import hmac +import hashlib + +def verify_scrapfly_signature(raw_body: bytes, signature_header: str, secret: str) -> bool: + if not signature_header or not secret: + return False + + expected = hmac.new( + secret.encode('utf-8'), + raw_body, + hashlib.sha256, + ).hexdigest().upper() + + return hmac.compare_digest(expected, signature_header.upper()) +``` + +Notes: +- Use `await request.body()` in FastAPI to get `bytes`. Do not call `await request.json()` before verifying. +- `hmac.compare_digest` is the documented constant-time comparator. + +## Security: Do Not Log the Raw Payload + +Scrapfly echoes the webhook signing secret in the body at `context.webhook.secret`. This is unusual compared to other providers and easy to miss. + +- **Never** log the raw payload, dump it to stdout in production, or forward it to third-party tools (Sentry, Datadog, Slack, etc.) without redacting `context.webhook.secret` first. +- If you persist webhooks for replay/debugging, strip or redact `context.webhook.secret` before storage. +- Anyone with the secret can forge valid signatures for your endpoint. + +```javascript +// Redact before logging / forwarding +const safe = { ...payload, context: { ...payload.context, webhook: { ...payload.context?.webhook, secret: '[REDACTED]' } } }; +``` + +## Common Gotchas + +- **Parsed JSON breaks signatures.** Verify against the exact bytes Scrapfly sent. In Express, mount `express.raw({ type: '*/*' })` on the webhook route (not `express.json`). In Next.js App Router, read with `await request.text()`. In FastAPI, use `await request.body()`. +- **Case of the hex digest.** Scrapfly's primary header is uppercase, but the `-Lowercase` variant exists for a reason. Always normalise both sides before comparing (the snippets above use `.toUpperCase()` / `.upper()`). +- **Header casing in HTTP frameworks.** HTTP header names are case-insensitive. Express lowercases everything; Next.js's `headers.get(...)` is also case-insensitive. Read `x-scrapfly-webhook-signature`. +- **No timestamp tolerance.** Don't reject for old timestamps — there isn't one. If you need replay protection, dedupe on `X-Scrapfly-Webhook-Id`. +- **Secret format.** Use the dashboard string verbatim. There is no `whsec_` prefix to strip and no base64 decode step. +- **Body encoding.** The HMAC is over bytes, not text. Avoid any middleware that transforms encoding (gzip middleware, BOM strippers, etc.) on the route. + +## Debugging Verification Failures + +1. **Log both signatures side-by-side** (the computed expected and the received header) — they should be identical, byte for byte, after normalising case. +2. **Log the body length** received vs. the `Content-Length` header. A mismatch means a middleware ate the body. +3. **Hash a known string with your secret** locally and compare with Scrapfly's documented Python sample: + ```python + hmac.new(b'YOUR-SECRET', b'{"data": "example"}', hashlib.sha256).hexdigest().upper() + ``` +4. **Check the right header.** `X-Scrapfly-Webhook-Signature` (uppercase hex) — not `Signature`, not `X-Signature`, not `webhook-signature`. +5. **Confirm you're using the right secret.** Webhooks are scoped per project + environment; the dashboard shows the secret for the specific webhook.