<This part is not AI generated>
I approached this project as a pipeline, that can be broken into stages. Mainly: Scraping, Processing (Text + Image) and Output generation. Passed a shared context object through stages. Runs locally, only ANTHROPIC_KEY is required.
Scraping. No brand specific knowledge in the crawler. Start with a large pool, extract images and let the next stage filter out. Ignored js-rendered pages for now, but would add playwright for the next iteration.
Processing (Text + Image). Text and Image processing as separate stages. Text with LLMs (Summarize chunks by Haiku, extract brand voice by Sonnet). For images, CLIP and phash for filtering, clustering with UMAP + k-selection, basically.
Output. JSON obj and turn it into PDF.
AI Usage
I used Claude Code to accelerate the process.
- Turning architectural notes into implementation
- Boilerplate, tests, fixtures
- Optimization and edge case hunting
What I manually verified:
- Full pipeline on 10+ brands
- Tweaked params and finetuned
- Error boundaries
If I had time
I would add
- FashionCLIP support
- Better color palette matching (skin/tan fabric issues, object bg detection etc)
- Cross-modal coherence check (visual cluster labels vs brand voice embeddings)
- Playwright auto-trigger when crawl yield is below threshold (on JS-heavy storefronts)
- GPU auto-detection (CPU is fine at ~150 images, breaks above ~2000)
- Stage-per-worker architecture with a job queue between them (each scaled independently)
</This part is not AI generated>
The brief: give it a brand name and URL, get back a structured Brand DNA, color palette, garment mix, aesthetic clusters, brand voice, as a PDF and a JSON sidecar.
The interesting constraint here was that the system had to work on any fashion brand without code changes. No per-site selectors, no scrapers tuned to a specific DOM. That ruled out the obvious approach and forced a more principled one.
The solution is a brand-agnostic pipeline: crawl generically, let CLIP zero-shot classify what's fashion, deduplicate in embedding space, cluster aesthetically, then use LLMs to synthesize brand voice from the text corpus. Each stage reads from and writes to a shared BrandContext. One stage failing doesn't crash the run, you get a partial dossier, not a crash.
pip install -e ".[dev]"
export ANTHROPIC_API_KEY="your-key"
make run BRAND=cos # full pipeline
make render-pdf BRAND=cos # re-render PDF from existing run, no crawl
make test # unit tests + eval regression gate
make eval # offline quality report on existing dossiersbrands:
- id: my_brand
name: My Brand
domain: https://my-brand.com
social:
instagram: my_brand_handle # optionalNo code changes.
brands.yaml → CLI → Pipeline
├── CrawlStage sitemap + BFS → images + text corpus
├── SocialStage public Instagram OG metadata, graceful
├── VisionStage CLIP filter → dedup → color → embed → hero → garment+patterns → cluster
├── TextAnalysisStage Haiku map → Sonnet synthesis + cluster labels
└── PDFStage dossier.json + dossier.pdf
Every stage implements a two-method protocol: name and run(ctx) → ctx. The pipeline wraps each in try/except. Failures land in ctx.failures and manifest.json. State flows through a single BrandContext object.
Dumb crawler, smart filter. No per-site selectors. Sitemap (indexes recursed, locale duplicates collapsed to one canonical path) → BFS fallback, collect everything, CLIP zero-shot decides what's fashion. You pull more than you need, but inference is local and fast.
One embedding space. OpenCLIP ViT-B/32 runs filtering, dedup, garment scoring, and clustering. A cosine threshold belongs to the space it was calibrated in — bringing in FashionCLIP would silently shift the geometry and everything downstream (0.95 dedup, 0.90 cluster-merge, logit scale) would re-mean. Not worth it at this scale. Dedup reuses the embeddings already captured during filtering, so nothing is re-encoded. pHash (Hamming ≤ 8) handles resized/recompressed duplicates first; CLIP cosine catches semantic ones.
Support-capped clustering with post-merge. UMAP → 50D → k-means, silhouette-selected k in [3, 6], k also bounded by N // min_images_per_cluster so you can't get 6 clusters of 2 from 13 images. After k-selection, clusters with cosine-similar centroids in the original CLIP space (≥ 0.90) are merged — UMAP sometimes splits one aesthetic into two separable blobs and silhouette rewards both. Silhouette 0.3–0.5 at ~100 images is normal for a brand with a unified look, not a failure. Confidence gates on it.
Adaptive LLM tiering. Corpus under ~12k chars goes straight to Sonnet. Larger runs a Haiku map step first (per-page summaries, 6 concurrent), then Sonnet reduces. Cluster labels are concurrent Sonnet vision calls (4 workers). Threads not asyncio — the sync Anthropic client is thread-safe, the pipeline is synchronous end-to-end, ThreadPoolExecutor is the right tool.
Soft garment voting. Hard argmax discards confidence — a barely-won vote and a certain one look the same. Each image contributes temperature-scaled probability mass across all categories instead; the pipeline sums expected counts. Logit scale is 100.0, matching CLIP's training temperature. Without it, softmax over raw cosines is near-uniform and every styling axis collapses to ~0.5.
Pantone matching. Nearest Pantone by ΔE in LAB space, matched against the full library. RGB distance isn't perceptually uniform. 120+ fashion-specific color names reduce Pantone fallback for common colors. Each palette entry also links to the catalog image that most prominently features that color.
Honest confidence. HIGH / MEDIUM / LOW derive from the signals the run actually produced: silhouette, image count, evidence-quote count. Not proxies like cluster count or corpus length.
The pipeline isn't open-loop. "Accuracy" has no gold label in brand strategy, so quality breaks into separately measurable properties:
Regression gate (make test, free, no API):
- Cluster determinism, seeded pipeline → byte-identical output
- Cross-seed stability (ARI) and permutation invariance — reported, not hard-gated, because low ARI at small N is an honest finding
- Synthetic palette recovery, known-color images in, known color recovered; black garments survive background removal
- Dedup correctness, same image at three sizes merges; three distinct images don't
- Calibration consistency, re-derives each stored dossier's confidence from its signals, flags mismatches
**refabric eval** (offline, free):
- Confidence calibration, flags dossiers claiming more than their signals support
- Discriminability, voice-token Jaccard across brands; catches near-boilerplate output
At current scale (~150 images/brand):
- CLIP inference runs locally, fast on CPU
- LLM cost scales with site size, not brand count
- Content-addressed storage deduplicates for free
- Brand runs share no state; wrapping
Pipeline.run()in a process pool is full horizontal scale-out
What breaks first: at 2000+ images, CPU CLIP becomes slow (fix: GPU device check + larger batches, the batch loop is already in clip_filter.py). Parallel brand runs can race on the brand-level content store (fix: per-run storage paths). Single-process orchestration doesn't scale to 100+ brands (fix: job queue with workers running the same Pipeline code unchanged).
Production sketch:
brands queue → worker pool → object store (S3/GCS, same SHA256 keys)
→ metadata store (Postgres)
→ eval gate (CI, frozen fixtures)
The pipeline code, Stage protocol, and BrandContext transfer unchanged.
- Crawler identifies itself:
User-Agent: RefabricBrandAgent/1.0, rate-limited to 2 req/s per host - Sitemap-first discovery, BFS fallback, same-origin links only
- No IP rotation, no browser spoofing, no CAPTCHA solving. Bot-protection pages are detected, logged to
ctx.crawl_blocked_by, and surfaced on the methodology page - Instagram: public Open Graph metadata only (profile image, bio, follower count). No login, no gallery scraping. A login wall is documented as
blocked, not retried
runs/{brand_id}/{timestamp}/
├── manifest.json # stage timings, image counts, failures
├── images/ # content-addressed image files
├── metadata/ # per-image metadata JSON
├── analysis/
│ ├── clusters.json # cluster assignments and metadata
│ ├── embeddings.npy # OpenCLIP embeddings (persisted for PDF re-render)
│ └── embedding_index.json # path list matching embeddings.npy rows
├── dossier.json # structured Brand DNA
└── dossier.pdf # human-readable report
Embeddings are persisted after VisionStage so that render-pdf can re-select the hero image and layout images without re-running the pipeline.
docker build -t refabric .
docker run -e ANTHROPIC_API_KEY=your-key refabric run cospytest tests/ -v~85 tests: config validation, URL filtering, dedup logic (pHash + CLIP), color extraction and Pantone matching, clustering (including cluster merging), CLIP filter, garment/pattern/fabric analysis, PDF generation, pipeline failure isolation, and the eval regression gate.