Build an LLM agent the way real teams ship them: as seven composable layers, each with a clear job, a clear contract, and a clear failure mode. Every request flows through the same pipeline and emits a per-layer trace so you can debug production in minutes, not hours.
Start learning at learnwithparam.com. Regional pricing available with discounts of up to 60%.
- Design an agent as a graph of small, testable layers instead of one giant function
- Add transport validation and rate limiting before anything reaches your LLM
- Route intents with a tiny state machine and branch to tools, retrieval, or direct reply
- Run tools safely with whitelists and argument parsing
- Keep per-thread memory with sane eviction limits
- Ground answers with ChromaDB semantic search when retrieval beats generation
- Enforce guardrails on both input and output (PII scrub, length caps)
- Emit structured traces and logs so every request is observable out of the box
- Transport - validates the request, enforces size and basic rate limits
- Orchestrator - a small state machine that picks
tool_use,retrieve, orreply - Tools -
get_timeandcalculator, dispatched through a safe registry - Memory - per-thread bounded conversation history
- Retrieval - ChromaDB semantic search over a seed knowledge base (degrades gracefully when disabled)
- Guardrails - PII scrubbing and length checks on every input and output
- Observability - per-request trace with per-layer timings plus a structlog event
- FastAPI - async Python web framework
- Pydantic - request and response validation
- ChromaDB + Sentence Transformers - embedded vector store and local embeddings
- structlog - structured logs for every request
- LLM Provider Pattern - supports OpenRouter, Fireworks, Gemini, OpenAI
- Docker - containerized development
- Python 3.11+
- uv (installed automatically by
make setup) - An API key from any supported LLM provider
make dev
# Or step by step:
make setup
# edit .env and add your API key
make runmake build
make up
make logs
make downOnce running, open http://localhost:8000/docs for the interactive Swagger UI.
Primary endpoints:
GET /production-agent/health- liveness check plus the list of layersPOST /production-agent/chat- body{ "message": "...", "thread_id": "t1" }GET /production-agent/trace/{thread_id}- the most recent per-layer trace for a thread
Work through these incrementally to build the full system:
- The Transport Layer - Validate inputs and add a tiny rate limiter
- The Orchestrator - Route between tool use, retrieval, and direct reply
- Tools with Guardrails - Register
get_timeand a safecalculator - Thread Memory - Store bounded per-thread conversation history
- Retrieval Layer - Seed a ChromaDB collection and wire up semantic search
- Input and Output Guardrails - Scrub emails, phone numbers, and SSN-like tokens
- Observability - Build the per-request trace and expose
GET /trace/{thread_id}
make help Show all available commands
make setup Initial setup (create .env, install deps)
make dev Setup and run (one command!)
make run Start FastAPI server
make build Build Docker image
make up Start container
make down Stop container
make clean Remove venv and cache
- Start the course: learnwithparam.com/courses/layered-production-ai-architecture
- AI Bootcamp for Software Engineers: learnwithparam.com/ai-bootcamp
- All courses: learnwithparam.com/courses