A fully local AI companion that runs entirely on-device via Ollama.
Accepts text, images, or both. Semantically routes each request to the right local model using cosine similarity.
-
Ollama — https://ollama.ai
Install and make sureollama serveis running. -
Python 3.10+
Pull the set that matches your RAM. The app auto-detects RAM at startup and selects the correct variants automatically — you just need to have them pulled.
ollama pull llama3.2:3b
ollama pull llama3.1:8b-instruct
ollama pull qwen2.5-coder:7b-instruct
ollama pull llama3.2-vision:11bollama pull llama3.2:3b
ollama pull llama3.1:8b-instruct-q4_K_S
ollama pull qwen2.5-coder:7b-instruct-q4_K_S
ollama pull llama3.2-vision:11b-instruct-q4_K_Sollama pull llama3.2:3b
ollama pull llama3.1:8b-instruct-q3_K_S
ollama pull qwen2.5-coder:7b-instruct-q3_K_S
ollama pull llava:7bcd local-multi-model
pip install -r requirements.txtOLLAMA_KEEP_ALIVE=2m uvicorn backend.main:app --reloadWindows (PowerShell):
$env:OLLAMA_KEEP_ALIVE="2m"; uvicorn backend.main:app --reloadThese four queries demonstrate that routing is driven by semantic similarity, not keyword matching. Upload any image for the vision test.
| Input | Expected route | Expected model |
|---|---|---|
is this urgent? meeting at 3pm tomorrow |
simple_task |
llama3.2:3b |
analyze the pros and cons of microservices |
complex_reasoning |
llama3.1:8b |
write a python function to reverse a linked list |
code |
qwen2.5-coder:7b-instruct |
upload screenshot, type what does this UI show? |
vision+complex_reasoning |
llava → llama3.1:8b |
The routing badge on every response shows: model — route (confidence score).
local-multi-model/
├── backend/
│ ├── main.py FastAPI app, /chat endpoint, SQLite persistence
│ ├── router.py Semantic routing via cosine similarity
│ ├── models.py Raw async httpx calls to Ollama API
│ ├── image_understand.py llava image-to-description pipeline
│ ├── hardware.py RAM detection, MODEL_CONFIGS, tier selection
│ └── config.py Route definitions, thresholds, constants
├── frontend/
│ ├── index.html Single-page chat UI (Tailwind CDN, vanilla JS)
│ └── styles.css Tailwind overrides
├── data/
│ ├── conversations.db SQLite — persists across restarts
│ └── uploads/ Saved uploaded images
└── requirements.txt
Text only → semantic router → simple / complex / code model
Image only → llava (description) → complex model [no routing]
Text+image → llava (description) + user text merged
→ semantic router
→ simple / complex / code model
Semantic routing: at startup, each route's example utterances are embedded
with FastEmbed (BAAI/bge-small-en-v1.5, ~130 MB, downloaded once and cached).
At query time, the input is embedded and cosine similarity is computed against
all stored utterance vectors. The route with the highest max-similarity wins
if ≥ 0.5; otherwise falls back to complex.
FastEmbed runs entirely locally with no Ollama dependency for routing — generation models still run via Ollama.
| Tier | RAM | Models |
|---|---|---|
| high | 16 GB+ | full-precision variants |
| medium | 12 GB | q4_K_S quantised variants |
| low | 8 GB | q3_K_S quantised variants |
nomic-embed-text (274 MB) is always the same across all tiers.
llama3.2:3b (simple tasks, 2 GB) is always the same across all tiers.
OLLAMA_KEEP_ALIVE=2m— Ollama unloads idle models after 2 minutes.- Models are loaded on demand; never pre-loaded at startup.
- Requests are processed sequentially, so at most one generation model
is in memory at a time alongside
nomic-embed-text. - Peak usage on 8 GB: ~274 MB (embed) + ~3.8 GB (q3_K_S text model) ≈ 4.1 GB, leaving headroom for OS and browser.
Every message is persisted in data/conversations.db with:
- message text, image path, response text
- model used, route detected, confidence score
- latency in seconds, timestamp, hardware tier at time of request
Survives restarts. Query via GET /history?limit=50.