Oshyn

A fully local AI companion that runs entirely on-device via Ollama.

Accepts text, images, or both. Semantically routes each request to the right local model using cosine similarity.

Prerequisites

Ollama — https://ollama.ai
Install and make sure ollama serve is running.
Python 3.10+

Pull models

Pull the set that matches your RAM. The app auto-detects RAM at startup and selects the correct variants automatically — you just need to have them pulled.

16 GB+ (high tier)

ollama pull llama3.2:3b
ollama pull llama3.1:8b-instruct
ollama pull qwen2.5-coder:7b-instruct
ollama pull llama3.2-vision:11b

12 GB (medium tier)

ollama pull llama3.2:3b
ollama pull llama3.1:8b-instruct-q4_K_S
ollama pull qwen2.5-coder:7b-instruct-q4_K_S
ollama pull llama3.2-vision:11b-instruct-q4_K_S

8 GB (low tier)

ollama pull llama3.2:3b
ollama pull llama3.1:8b-instruct-q3_K_S
ollama pull qwen2.5-coder:7b-instruct-q3_K_S
ollama pull llava:7b

Install Python dependencies

cd local-multi-model
pip install -r requirements.txt

Run

OLLAMA_KEEP_ALIVE=2m uvicorn backend.main:app --reload

Windows (PowerShell):

$env:OLLAMA_KEEP_ALIVE="2m"; uvicorn backend.main:app --reload

Open http://localhost:8000

Test routing

These four queries demonstrate that routing is driven by semantic similarity, not keyword matching. Upload any image for the vision test.

Input	Expected route	Expected model
`is this urgent? meeting at 3pm tomorrow`	`simple_task`	llama3.2:3b
`analyze the pros and cons of microservices`	`complex_reasoning`	llama3.1:8b
`write a python function to reverse a linked list`	`code`	qwen2.5-coder:7b-instruct
upload screenshot, type `what does this UI show?`	`vision+complex_reasoning`	llava → llama3.1:8b

The routing badge on every response shows: model — route (confidence score).

Architecture

local-multi-model/
├── backend/
│   ├── main.py              FastAPI app, /chat endpoint, SQLite persistence
│   ├── router.py            Semantic routing via cosine similarity
│   ├── models.py            Raw async httpx calls to Ollama API
│   ├── image_understand.py  llava image-to-description pipeline
│   ├── hardware.py          RAM detection, MODEL_CONFIGS, tier selection
│   └── config.py            Route definitions, thresholds, constants
├── frontend/
│   ├── index.html           Single-page chat UI (Tailwind CDN, vanilla JS)
│   └── styles.css           Tailwind overrides
├── data/
│   ├── conversations.db     SQLite — persists across restarts
│   └── uploads/             Saved uploaded images
└── requirements.txt

Routing flow

Text only    →  semantic router  →  simple / complex / code model

Image only   →  llava (description)  →  complex model  [no routing]

Text+image   →  llava (description) + user text merged
             →  semantic router
             →  simple / complex / code model

Semantic routing: at startup, each route's example utterances are embedded with FastEmbed (BAAI/bge-small-en-v1.5, ~130 MB, downloaded once and cached). At query time, the input is embedded and cosine similarity is computed against all stored utterance vectors. The route with the highest max-similarity wins if ≥ 0.5; otherwise falls back to complex.

FastEmbed runs entirely locally with no Ollama dependency for routing — generation models still run via Ollama.

Hardware tiers

Tier	RAM	Models
high	16 GB+	full-precision variants
medium	12 GB	q4_K_S quantised variants
low	8 GB	q3_K_S quantised variants

nomic-embed-text (274 MB) is always the same across all tiers.
llama3.2:3b (simple tasks, 2 GB) is always the same across all tiers.

Memory management

OLLAMA_KEEP_ALIVE=2m — Ollama unloads idle models after 2 minutes.
Models are loaded on demand; never pre-loaded at startup.
Requests are processed sequentially, so at most one generation model is in memory at a time alongside nomic-embed-text.
Peak usage on 8 GB: ~274 MB (embed) + ~3.8 GB (q3_K_S text model) ≈ 4.1 GB, leaving headroom for OS and browser.

Conversation history

Every message is persisted in data/conversations.db with:

message text, image path, response text
model used, route detected, confidence score
latency in seconds, timestamp, hardware tier at time of request

Survives restarts. Query via GET /history?limit=50.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
backend		backend
data/uploads		data/uploads
frontend		frontend
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Oshyn

Prerequisites

Pull models

16 GB+ (high tier)

12 GB (medium tier)

8 GB (low tier)

Install Python dependencies

Run

Test routing

Architecture

Routing flow

Hardware tiers

Memory management

Conversation history

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Oshyn

Prerequisites

Pull models

16 GB+ (high tier)

12 GB (medium tier)

8 GB (low tier)

Install Python dependencies

Run

Test routing

Architecture

Routing flow

Hardware tiers

Memory management

Conversation history

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages