Open-source web + video crawling toolkit for AI
Quickstart β’ Features β’ Why CrawlKit β’ API β’ Self-Host β’ Managed API β
Every AI app needs web data. But current tools force you to choose: web OR video, fast OR accurate, simple OR powerful.
CrawlKit does it all in one API call:
from crawlkit import CrawlKit
ck = CrawlKit()
# Webpage β structured data + RAG chunks
page = ck.scrape("https://vnexpress.net/some-article")
print(page.content_type) # "news"
print(page.chunks) # 15 RAG-ready chunks
# Video β transcript + metadata (same API!)
video = ck.scrape("https://youtube.com/watch?v=abc123")
print(video.transcript) # Full text transcript
print(video.duration) # 1344 seconds| Feature | CrawlKit | Crawl4AI | Firecrawl | Jina Reader |
|---|---|---|---|---|
| Web crawling | β | β | β | β |
| YouTube transcripts | β | β | β | β |
| TikTok extraction | β | β | β | β |
| Facebook Video | β | β | β | β |
| PDF + OCR | β | β | β | β |
| NLP extraction | β | β | β | β |
| Anti-bot stealth | β | β | β | β |
| Screenshot capture | β | β | β | β |
| RAG-ready chunks | β | β | β | β |
| Domain-specific parsers | β 10+ | β | β | β |
| URL monitoring | β | β | β | β |
| Self-hostable | β | β | β | β |
| Open source | β Apache 2.0 | β | β | β |
TL;DR: CrawlKit = Crawl4AI + video support + OCR + NLP + domain parsers.
pip install crawlkit
playwright install chromiumfrom crawlkit import CrawlKit
ck = CrawlKit()
# Any webpage
result = ck.scrape("https://example.com")
print(result.markdown)
# YouTube video β transcript
result = ck.scrape("https://youtube.com/watch?v=dQw4w9WgXcQ")
print(result.transcript)
# With RAG chunking
result = ck.scrape("https://example.com", chunk=True)
for chunk in result.chunks:
print(f"[{chunk.token_estimate} tokens] {chunk.content[:80]}...")git clone https://github.com/Paparusi/crawlkit.git
cd crawlkit
cp .env.example .env
docker compose up -d
# API available at http://localhost:8000No setup needed. Get a free API key at crawlkit.org
curl -X POST https://api.crawlkit.org/v1/scrape \
-H "Authorization: Bearer ck_free_xxx" \
-H "Content-Type: application/json" \
-d '{"url": "https://youtube.com/watch?v=abc123"}'- Smart rendering β Auto-detects static vs JS-heavy pages (httpx β Playwright fallback)
- Anti-bot stealth β Playwright-stealth, random fingerprints, human-like delays
- Domain parsers β 10+ site-specific parsers for structured extraction
- Batch crawling β Scrape hundreds of URLs in one request
- YouTube β Full transcripts, chapters, metadata, tags (2-3s per video)
- TikTok β Captions, hashtags, engagement metrics
- Facebook Video β Metadata + captions
- No video download β Text extraction only. Zero bandwidth waste.
- RAG chunks β Smart chunking by content type (legal β by article, news β by paragraph)
- NLP extraction β Entities (people, orgs, locations) + keywords
- Token estimation β Each chunk tagged with token count
- Multiple formats β JSON, Markdown, plain text
- PDF parsing β Text, tables, metadata extraction
- Scanned PDF β text β Auto-detect + OCR (EasyOCR)
- 50MB max β Handles large documents
- Full-page screenshots β PNG/JPEG, viewport or full-page
- URL monitoring β Watch pages for changes, webhook notifications
- Change detection β SHA256 hash comparison
CrawlKit auto-detects the site and applies the right parser:
| Parser | Site | What You Get |
|---|---|---|
youtube |
YouTube | Transcript, chapters, duration, views, tags |
tiktok |
TikTok | Caption, hashtags, music, engagement |
facebook_video |
Video metadata, captions | |
tvpl |
thuvienphapluat.vn | Legal documents, articles, clauses |
vbpl |
vbpl.vn | Government legal database |
vnexpress |
vnexpress.net | News articles, clean text |
batdongsan |
batdongsan.com.vn | Property listings, prices |
cafef |
cafef.vn | Financial news, stock data |
github |
github.com | Repo/file content |
pdf |
Any .pdf URL | Text, tables, metadata |
| Generic | Any URL | Clean markdown + structured data |
Building a custom parser? See CONTRIBUTING.md β PRs welcome!
curl -X POST http://localhost:8000/v1/scrape \
-H "Content-Type: application/json" \
-d '{
"url": "https://youtube.com/watch?v=abc123",
"chunk": true,
"nlp": true,
"stealth": false,
"screenshot": false,
"ocr": false
}'π¦ Response
{
"success": true,
"data": {
"url": "https://youtube.com/watch?v=abc123",
"title": "Video Title",
"content_type": "video",
"parser_used": "youtube",
"crawl_time_ms": 2150,
"markdown": "# Video Title\n\n...",
"structured": {
"transcript": "Full transcript text...",
"duration": 1344,
"views": 125000,
"chapters": [...]
},
"chunks": [
{
"content": "...",
"metadata": {"timestamp": "00:00", "section": "intro"},
"token_estimate": 487
}
],
"nlp": {
"language": "vi",
"entities": {"people": [], "organizations": [], "locations": []},
"keywords": ["keyword1", "keyword2"]
}
}
}{
"urls": ["https://url1.com", "https://url2.com"],
"chunk": true,
"delay": 1.5
}{
"source": "tvpl",
"query": "Doanh-nghiep",
"limit": 100
}{
"url": "https://example.com",
"full_page": true,
"format": "png"
}{
"url": "https://example.com/page",
"webhook_url": "https://your-server.com/webhook",
"check_interval_minutes": 60
}git clone https://github.com/Paparusi/crawlkit.git
cd crawlkit
cp .env.example .env
# Edit .env with your config
docker compose up -dThe API runs at http://localhost:8000. No external dependencies required.
π Environment Variables
# Required
SUPABASE_URL=https://your-project.supabase.co
SUPABASE_SERVICE_KEY=your-service-key
CRAWLKIT_MASTER_KEY=your-master-key
# Optional
PLAYWRIGHT_BROWSERS_PATH=/ms-playwright
PORT=8000Don't want to self-host? Use the managed API at crawlkit.org
| Plan | Price | Requests/day | Features |
|---|---|---|---|
| Free | $0 | 100 | All parsers, all formats |
| Starter | $19/mo | 2,000 | + Video, OCR, NLP, stealth |
| Pro | $79/mo | 20,000 | + Batch, monitoring, priority |
| Enterprise | Custom | Unlimited | + Dedicated infra, SLA |
We welcome contributions! See CONTRIBUTING.md for guidelines.
Easy wins:
- Add a new domain parser (see
crawlkit/parsers/for examples) - Improve extraction quality for existing parsers
- Add tests
- Fix bugs from Issues
Apache 2.0 β Use it however you want. See LICENSE.
If CrawlKit helps your project, give it a star! It helps others discover the project.
Built with β€οΈ for the AI community
Website β’
Issues β’
Discussions
