Skip to content

Paparusi/crawlkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

CrawlKit

CrawlKit

Open-source web + video crawling toolkit for AI

Stars License PyPI Python Docker CI

Quickstart β€’ Features β€’ Why CrawlKit β€’ API β€’ Self-Host β€’ Managed API β†’


πŸ€” Why CrawlKit?

Every AI app needs web data. But current tools force you to choose: web OR video, fast OR accurate, simple OR powerful.

CrawlKit does it all in one API call:

from crawlkit import CrawlKit

ck = CrawlKit()

# Webpage β†’ structured data + RAG chunks
page = ck.scrape("https://vnexpress.net/some-article")
print(page.content_type)   # "news"
print(page.chunks)         # 15 RAG-ready chunks

# Video β†’ transcript + metadata (same API!)
video = ck.scrape("https://youtube.com/watch?v=abc123")
print(video.transcript)    # Full text transcript
print(video.duration)      # 1344 seconds

⚑ Why not Crawl4AI / Firecrawl / Jina?

Feature CrawlKit Crawl4AI Firecrawl Jina Reader
Web crawling βœ… βœ… βœ… βœ…
YouTube transcripts βœ… ❌ ❌ ❌
TikTok extraction βœ… ❌ ❌ ❌
Facebook Video βœ… ❌ ❌ ❌
PDF + OCR βœ… ❌ βœ… ❌
NLP extraction βœ… ❌ ❌ ❌
Anti-bot stealth βœ… βœ… βœ… ❌
Screenshot capture βœ… βœ… βœ… ❌
RAG-ready chunks βœ… ❌ βœ… ❌
Domain-specific parsers βœ… 10+ ❌ ❌ ❌
URL monitoring βœ… ❌ ❌ ❌
Self-hostable βœ… βœ… ❌ ❌
Open source βœ… Apache 2.0 βœ… ❌ ❌

TL;DR: CrawlKit = Crawl4AI + video support + OCR + NLP + domain parsers.

πŸš€ Quickstart

Option 1: pip install

pip install crawlkit
playwright install chromium
from crawlkit import CrawlKit

ck = CrawlKit()

# Any webpage
result = ck.scrape("https://example.com")
print(result.markdown)

# YouTube video β†’ transcript
result = ck.scrape("https://youtube.com/watch?v=dQw4w9WgXcQ")
print(result.transcript)

# With RAG chunking
result = ck.scrape("https://example.com", chunk=True)
for chunk in result.chunks:
    print(f"[{chunk.token_estimate} tokens] {chunk.content[:80]}...")

Option 2: Docker (self-host API)

git clone https://github.com/Paparusi/crawlkit.git
cd crawlkit
cp .env.example .env

docker compose up -d
# API available at http://localhost:8000

Option 3: Managed API

No setup needed. Get a free API key at crawlkit.org

curl -X POST https://api.crawlkit.org/v1/scrape \
  -H "Authorization: Bearer ck_free_xxx" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://youtube.com/watch?v=abc123"}'

✨ Features

🌐 Web Crawling

  • Smart rendering β€” Auto-detects static vs JS-heavy pages (httpx β†’ Playwright fallback)
  • Anti-bot stealth β€” Playwright-stealth, random fingerprints, human-like delays
  • Domain parsers β€” 10+ site-specific parsers for structured extraction
  • Batch crawling β€” Scrape hundreds of URLs in one request

🎬 Video Intelligence

  • YouTube β€” Full transcripts, chapters, metadata, tags (2-3s per video)
  • TikTok β€” Captions, hashtags, engagement metrics
  • Facebook Video β€” Metadata + captions
  • No video download β€” Text extraction only. Zero bandwidth waste.

🧠 AI-Ready Output

  • RAG chunks β€” Smart chunking by content type (legal β†’ by article, news β†’ by paragraph)
  • NLP extraction β€” Entities (people, orgs, locations) + keywords
  • Token estimation β€” Each chunk tagged with token count
  • Multiple formats β€” JSON, Markdown, plain text

πŸ“Έ OCR + PDF

  • PDF parsing β€” Text, tables, metadata extraction
  • Scanned PDF β†’ text β€” Auto-detect + OCR (EasyOCR)
  • 50MB max β€” Handles large documents

πŸ“· Screenshot + Monitoring

  • Full-page screenshots β€” PNG/JPEG, viewport or full-page
  • URL monitoring β€” Watch pages for changes, webhook notifications
  • Change detection β€” SHA256 hash comparison

πŸ”Œ Domain Parsers

CrawlKit auto-detects the site and applies the right parser:

Parser Site What You Get
youtube YouTube Transcript, chapters, duration, views, tags
tiktok TikTok Caption, hashtags, music, engagement
facebook_video Facebook Video metadata, captions
tvpl thuvienphapluat.vn Legal documents, articles, clauses
vbpl vbpl.vn Government legal database
vnexpress vnexpress.net News articles, clean text
batdongsan batdongsan.com.vn Property listings, prices
cafef cafef.vn Financial news, stock data
github github.com Repo/file content
pdf Any .pdf URL Text, tables, metadata
Generic Any URL Clean markdown + structured data

Building a custom parser? See CONTRIBUTING.md β€” PRs welcome!

πŸ“‘ API Reference

POST /v1/scrape β€” Scrape a URL

curl -X POST http://localhost:8000/v1/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://youtube.com/watch?v=abc123",
    "chunk": true,
    "nlp": true,
    "stealth": false,
    "screenshot": false,
    "ocr": false
  }'
πŸ“¦ Response
{
  "success": true,
  "data": {
    "url": "https://youtube.com/watch?v=abc123",
    "title": "Video Title",
    "content_type": "video",
    "parser_used": "youtube",
    "crawl_time_ms": 2150,
    "markdown": "# Video Title\n\n...",
    "structured": {
      "transcript": "Full transcript text...",
      "duration": 1344,
      "views": 125000,
      "chapters": [...]
    },
    "chunks": [
      {
        "content": "...",
        "metadata": {"timestamp": "00:00", "section": "intro"},
        "token_estimate": 487
      }
    ],
    "nlp": {
      "language": "vi",
      "entities": {"people": [], "organizations": [], "locations": []},
      "keywords": ["keyword1", "keyword2"]
    }
  }
}

POST /v1/batch β€” Batch scrape

{
  "urls": ["https://url1.com", "https://url2.com"],
  "chunk": true,
  "delay": 1.5
}

POST /v1/discover β€” Discover URLs from a site

{
  "source": "tvpl",
  "query": "Doanh-nghiep",
  "limit": 100
}

POST /v1/screenshot β€” Capture screenshot

{
  "url": "https://example.com",
  "full_page": true,
  "format": "png"
}

POST /v1/watch β€” Monitor URL for changes

{
  "url": "https://example.com/page",
  "webhook_url": "https://your-server.com/webhook",
  "check_interval_minutes": 60
}

GET /v1/health β€’ GET /v1/parsers

🐳 Self-Hosting

git clone https://github.com/Paparusi/crawlkit.git
cd crawlkit
cp .env.example .env
# Edit .env with your config

docker compose up -d

The API runs at http://localhost:8000. No external dependencies required.

πŸ“‹ Environment Variables
# Required
SUPABASE_URL=https://your-project.supabase.co
SUPABASE_SERVICE_KEY=your-service-key
CRAWLKIT_MASTER_KEY=your-master-key

# Optional
PLAYWRIGHT_BROWSERS_PATH=/ms-playwright
PORT=8000

☁️ Managed API

Don't want to self-host? Use the managed API at crawlkit.org

Plan Price Requests/day Features
Free $0 100 All parsers, all formats
Starter $19/mo 2,000 + Video, OCR, NLP, stealth
Pro $79/mo 20,000 + Batch, monitoring, priority
Enterprise Custom Unlimited + Dedicated infra, SLA

🀝 Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

Easy wins:

  • Add a new domain parser (see crawlkit/parsers/ for examples)
  • Improve extraction quality for existing parsers
  • Add tests
  • Fix bugs from Issues

πŸ“„ License

Apache 2.0 β€” Use it however you want. See LICENSE.

⭐ Star History

If CrawlKit helps your project, give it a star! It helps others discover the project.

Star History Chart


Built with ❀️ for the AI community
Website β€’ Issues β€’ Discussions

About

πŸ•·οΈ Open-source web crawling toolkit β€” Video, OCR, NLP, Stealth, 10+ parsers

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors