Skip to content

tharu-jwd/oshyn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Oshyn

A fully local AI companion that runs entirely on-device via Ollama.

Accepts text, images, or both. Semantically routes each request to the right local model using cosine similarity.


Prerequisites

  1. Ollamahttps://ollama.ai
    Install and make sure ollama serve is running.

  2. Python 3.10+


Pull models

Pull the set that matches your RAM. The app auto-detects RAM at startup and selects the correct variants automatically — you just need to have them pulled.

16 GB+ (high tier)

ollama pull llama3.2:3b
ollama pull llama3.1:8b-instruct
ollama pull qwen2.5-coder:7b-instruct
ollama pull llama3.2-vision:11b

12 GB (medium tier)

ollama pull llama3.2:3b
ollama pull llama3.1:8b-instruct-q4_K_S
ollama pull qwen2.5-coder:7b-instruct-q4_K_S
ollama pull llama3.2-vision:11b-instruct-q4_K_S

8 GB (low tier)

ollama pull llama3.2:3b
ollama pull llama3.1:8b-instruct-q3_K_S
ollama pull qwen2.5-coder:7b-instruct-q3_K_S
ollama pull llava:7b

Install Python dependencies

cd local-multi-model
pip install -r requirements.txt

Run

OLLAMA_KEEP_ALIVE=2m uvicorn backend.main:app --reload

Windows (PowerShell):

$env:OLLAMA_KEEP_ALIVE="2m"; uvicorn backend.main:app --reload

Open http://localhost:8000


Test routing

These four queries demonstrate that routing is driven by semantic similarity, not keyword matching. Upload any image for the vision test.

Input Expected route Expected model
is this urgent? meeting at 3pm tomorrow simple_task llama3.2:3b
analyze the pros and cons of microservices complex_reasoning llama3.1:8b
write a python function to reverse a linked list code qwen2.5-coder:7b-instruct
upload screenshot, type what does this UI show? vision+complex_reasoning llava → llama3.1:8b

The routing badge on every response shows: model — route (confidence score).


Architecture

local-multi-model/
├── backend/
│   ├── main.py              FastAPI app, /chat endpoint, SQLite persistence
│   ├── router.py            Semantic routing via cosine similarity
│   ├── models.py            Raw async httpx calls to Ollama API
│   ├── image_understand.py  llava image-to-description pipeline
│   ├── hardware.py          RAM detection, MODEL_CONFIGS, tier selection
│   └── config.py            Route definitions, thresholds, constants
├── frontend/
│   ├── index.html           Single-page chat UI (Tailwind CDN, vanilla JS)
│   └── styles.css           Tailwind overrides
├── data/
│   ├── conversations.db     SQLite — persists across restarts
│   └── uploads/             Saved uploaded images
└── requirements.txt

Routing flow

Text only    →  semantic router  →  simple / complex / code model

Image only   →  llava (description)  →  complex model  [no routing]

Text+image   →  llava (description) + user text merged
             →  semantic router
             →  simple / complex / code model

Semantic routing: at startup, each route's example utterances are embedded with FastEmbed (BAAI/bge-small-en-v1.5, ~130 MB, downloaded once and cached). At query time, the input is embedded and cosine similarity is computed against all stored utterance vectors. The route with the highest max-similarity wins if ≥ 0.5; otherwise falls back to complex.

FastEmbed runs entirely locally with no Ollama dependency for routing — generation models still run via Ollama.

Hardware tiers

Tier RAM Models
high 16 GB+ full-precision variants
medium 12 GB q4_K_S quantised variants
low 8 GB q3_K_S quantised variants

nomic-embed-text (274 MB) is always the same across all tiers.
llama3.2:3b (simple tasks, 2 GB) is always the same across all tiers.

Memory management

  • OLLAMA_KEEP_ALIVE=2m — Ollama unloads idle models after 2 minutes.
  • Models are loaded on demand; never pre-loaded at startup.
  • Requests are processed sequentially, so at most one generation model is in memory at a time alongside nomic-embed-text.
  • Peak usage on 8 GB: ~274 MB (embed) + ~3.8 GB (q3_K_S text model) ≈ 4.1 GB, leaving headroom for OS and browser.

Conversation history

Every message is persisted in data/conversations.db with:

  • message text, image path, response text
  • model used, route detected, confidence score
  • latency in seconds, timestamp, hardware tier at time of request

Survives restarts. Query via GET /history?limit=50.

About

Locally run AI companion with semantic aware model selection. Perfect in everything, but never sees you.

Resources

Stars

Watchers

Forks

Contributors