A RAG (Retrieval-Augmented Generation) API for Stanford Research Computing Center (SRCC) documentation. Answers questions about the Sherlock, Farmshare, Oak, Elm, Carina, and Nero clusters using documentation scraped from their GitHub repos and websites.
Runs on ada-lovelace.stanford.edu — 96 Ampere-1a cores, 378 GB RAM, 2x NVIDIA L4 GPUs (22.5 GiB each).
Client -> FastAPI (app/main.py)
+-- RAGService
|-- vLLM (Qwen3-32B-AWQ, tensor_parallel_size=2)
|-- BM25 retriever (per cluster)
|-- FAISS vector index (per cluster, hybrid mode)
| +-- gte-large-en-v1.5 embeddings (CPU)
+-- Semantic response cache (SQLite + sentence-transformers)
+-- Content-aware invalidation via scraper manifests
The model runs with tensor_parallel_size=2, occupying both L4 GPUs in a single vLLM worker. vLLM's async continuous-batching engine handles all concurrent requests natively — no separate load balancer is needed.
Retrieval pipeline: Each query is run through both BM25 (keyword) and FAISS (semantic) retrievers. Results are merged via Reciprocal Rank Fusion (RRF), with FAISS weighted higher by default. Documents are split on ## header boundaries to preserve section semantics (e.g., "Upcoming Classes" vs "Recent Classes").
Cache invalidation: When the scrapers run, they produce content manifests (SHA-256 hashes per file). On the next service restart, the RAG service compares the current manifest against the previous one and evicts only the cache entries whose source documents changed. Stable content stays cached.
- Apptainer/Singularity
- Python 3.x with PyYAML on the host (for reading
config.yamlin shell scripts) - 2x NVIDIA L4 GPUs (or equivalent with 22+ GiB VRAM each)
- Model files downloaded to the paths set in
config.yaml(seesetup_upgrade.sh)
# 1. Download the model and rebuild the container (first time only)
./setup_upgrade.sh
# 2. Fetch and process all documentation (run once, or when docs change)
./filemagic.sh
# 3. Start the API
./main.sh
# Access the API
curl http://ada-lovelace.stanford.edu:8000/health
curl http://ada-lovelace.stanford.edu:8000/docs./main.sh # Production mode (default)
./main.sh dev # Dev mode — enables uvicorn --reload
./main.sh multi # Exits with error (deprecated, see below)Production mode (./main.sh):
- Stops any existing
chatapiApptainer instance and kills stray vLLM processes - Cleans up orphaned POSIX shared memory blocks in
/dev/shmleft by SIGKILL'd processes - Reads model path, port, host, and log dir from
config.yaml - Validates the model directory exists and contains
config.json - Builds
chatbot.siffromchatbot.defif not already built - Starts the Apptainer instance with GPU support; binds
$PWD -> /workspaceand the parent models directory so both the LLM and the embedding model are accessible - Launches uvicorn serving
app.main:appon0.0.0.0:<port> - Model loading takes several minutes; watch for
Application startup complete
Dev mode (./main.sh dev):
- Same as production but adds
--reload --reload-dir appto uvicorn - Each code change triggers a full model reload; use only when needed
Multi mode: Exits immediately with an error. tensor_parallel_size=2 occupies both GPUs in a single worker — a second worker would OOM immediately.
./filemagic.shBuilds the file_processing.sif Apptainer container (if not already present), then runs four steps inside it in sequence:
file_magic.py— clones GitHub repos and converts their MkDocs documentation to flat.mdfiles underdocs/scrape_srcc.py— crawlshttps://srcc.stanford.edu->docs/srcc/scrape_static_docs.py— crawls Carina and Nero documentation sites ->docs/carina/,docs/nero/generate_manifests.py— writes.content_manifest.jsonin each docs subdirectory (SHA-256 hashes per file, used for cache invalidation on next restart)
./setup_upgrade.sh # Run all steps
./setup_upgrade.sh rebuild # Only rebuild container
./setup_upgrade.sh download # Only download model from HuggingFace
./setup_upgrade.sh tune # Only run interactive BM25 threshold tuningSteps:
- Rebuild — deletes and rebuilds
chatbot.siffromchatbot.def. Run after changingrequirements.txtor the container definition. - Download — downloads
Qwen/Qwen3-32B-AWQ(~35 GB) viahuggingface-cliand updatesconfig.yamlwith the new model path and type. - Tune — sends a set of test queries to the running API and displays source counts, then offers an interactive prompt to update
MIN_BM25_SCOREinconfig.yaml. Watchlogs/myapp.logfor the per-document BM25 scores while it runs.
./stop_multi_gpu.shKills load balancer and worker processes by PID file, then by process name as a fallback. Cleans up .worker1.pid, .worker2.pid, .loadbalancer.pid.
./comprehensive_test.sh
./comprehensive_test.sh --port 8001 --host localhost
./comprehensive_test.sh --no-cache-clearRuns six test groups against the live API. Requires jq.
| Test | What it checks |
|---|---|
| 0 — Connectivity | /health returns 200 and lists clusters |
| 1 — Basic RAG | /query/ returns a substantive answer with sources |
| 2 — Citation quality | Inline [Title](URL) links present; no raw .md filenames cited |
| 3 — Grounding guard | Off-topic questions refused; no false-positive disclaimers on grounded answers |
| 4 — Cross-cluster isolation | Sherlock and Farmshare queries return distinct answers tagged with correct cluster |
| 5 — Semantic cache | Exact and semantically similar repeats return cached responses in <3 s |
| 6 — Concurrent throughput | 3 parallel requests across clusters all succeed |
Clears the semantic cache before running by default (POST /cache/clear). Results are saved to /tmp/chatbot_test_<timestamp>/ and can be inspected with jq.
./test_cache.shSends four queries (exact match, near-exact, semantically similar) against localhost:8000 and compares responses. Faster than the full test suite when you just want to verify caching is working.
./gpu.shRuns inside the chatapi Apptainer instance. Reports PyTorch version, CUDA version, GPU names and compute capabilities, and benchmarks FP16 matrix multiply on the L4 GPUs. Also checks bitsandbytes availability. Use to verify GPU compute is working before loading the full model.
./diagnose_gen.shLoads TinyLlama (1.1B, fast) inside the container and runs five focused tests: single forward pass latency, generation with GPU utilization tracking, device-location verification, model-to-CUDA timing, and PyTorch CUDA settings. Use to distinguish CPU-fallback issues from model/config problems.
| Script | Status |
|---|---|
start_multi_gpu.sh |
Deprecated — exits immediately with an error. Incompatible with tensor_parallel_size=2; use ./main.sh instead. |
bench_gemma.sh |
Legacy — benchmarks Gemma 2 9B via HuggingFace Transformers (not vLLM). No longer relevant. |
benchmark_llama.sh |
Legacy — benchmarks TinyLlama 1.1B. No longer relevant. |
Processes documentation from four Stanford RC GitHub repositories:
| Repo | Output dir |
|---|---|
stanford-rc/farmshare-docs |
docs/farmshare/ |
stanford-rc/docs.elm.stanford.edu |
docs/elm/ |
stanford-rc/docs.oak.stanford.edu |
docs/oak/ |
stanford-rc/www.sherlock.stanford.edu |
docs/sherlock/ |
For each repo:
- Shallow-clones the repo (
git clone --depth 1) - Parses
mkdocs.ymlto extract the navigation tree andsite_url - Copies each doc to a flat output directory with YAML front matter (
title,url) - Writes a URL map CSV (
docs/<repo>_url_map.csv) - For Sherlock: additionally scrapes specific live pages (facts, tech specs, software list) via HTTP
Environment variables:
| Variable | Default | Description |
|---|---|---|
REPO_CLONE_DIR |
docs |
Base directory for cloned repos and output |
GITHUB_TOKEN |
-- | Token for authenticated GitHub cloning |
LOG_FILE |
magicFile.log |
Log file path |
Two-pass scraper for https://srcc.stanford.edu. Output: docs/srcc/.
Pass 1 — JSON:API fetches structured Drupal content types:
stanford_news,stanford_person,stanford_policy,stanford_publication,stanford_course- Uses cursor-based pagination; requests only needed fields
- Nodes with no body content are deferred to the HTML pass
Pass 2 — HTML crawl picks up stanford_page (Layout Builder) and any pages missed by JSON:API:
- Extracts content from Layout Builder regions; deduplicates by content fingerprint (first 300 chars)
- Strips nav, header, footer, scripts, Drupal placeholders
- Filters non-HTTP links (
mailto:,tel:,javascript:), external domains, binary assets, and noise paths (/user,/admin,/search,/events, etc.)
Environment variables:
| Variable | Default | Description |
|---|---|---|
SRCC_OUTPUT_DIR |
docs/srcc |
Output directory |
SRCC_MAX_PAGES |
500 |
Maximum pages for the HTML crawl pass |
LOG_FILE |
logs/scrapers.log |
Log file path |
Generic scraper for Stanford static documentation sites (Jekyll/similar, sharing the common <main id="page-content"> template).
Currently configured for:
| Site | Output dir |
|---|---|
https://docs.carina.stanford.edu |
docs/carina/ |
https://nero-docs.stanford.edu |
docs/nero/ |
Seeds the crawl queue from the site's nav, then follows internal links. Strips sidebar, nav, header, footer, and scripts. To add a new site, add an entry to the SITES dict at the top of the file.
Can be run standalone to scrape a single site:
python scrape_static_docs.py carina
python scrape_static_docs.py neroEnvironment variables:
| Variable | Default | Description |
|---|---|---|
STATIC_DOCS_OUTPUT_DIR |
docs |
Base output directory |
STATIC_DOCS_MAX_PAGES |
200 |
Maximum pages per site |
LOG_FILE |
logs/scrapers.log |
Log file path |
python generate_manifests.py # default: docs/
python generate_manifests.py /path/to/docsWrites a .content_manifest.json file in each docs subdirectory. Each manifest maps filenames to SHA-256 hashes. The RAG service compares these at startup to detect which docs changed since the last scrape and selectively evicts stale cache entries.
Called automatically at the end of filemagic.sh.
| Method | Path | Description |
|---|---|---|
GET |
/ |
Service status |
GET |
/health |
Health check — 503 if model not loaded or no retrievers |
POST |
/query/ |
Submit a question; returns answer, cluster, and sources |
POST |
/cache/clear |
Flush the semantic response cache |
GET |
/stats |
JSON metrics: latency percentiles, cache hit rate, per-cluster counts, top queries |
GET |
/dashboard |
Live monitoring dashboard (Chart.js UI, auto-refreshes every 30 s) |
GET |
/docs |
Auto-generated OpenAPI UI |
Query request:
{ "query": "How do I submit a GPU job on Sherlock?", "cluster": "sherlock" }cluster is optional — the service will attempt to detect it from the query text.
Query response:
{
"answer": "...",
"cluster": "sherlock",
"sources": [{ "title": "...", "url": "..." }]
}Three unit files live in systemd/. Install them for automatic startup and scheduled doc scraping.
# Copy units to systemd
sudo cp systemd/ada-chatbot.service /etc/systemd/system/
sudo cp systemd/ada-chatbot-scrape.service /etc/systemd/system/
sudo cp systemd/ada-chatbot-scrape.timer /etc/systemd/system/
# Reload and enable
sudo systemctl daemon-reload
sudo systemctl enable --now ada-chatbot.service
sudo systemctl enable --now ada-chatbot-scrape.timer| Unit | Type | Purpose |
|---|---|---|
ada-chatbot.service |
Service | Starts the API at boot; restarts on crash |
ada-chatbot-scrape.service |
Service (oneshot) | Runs the full scrape + manifest pipeline |
ada-chatbot-scrape.timer |
Timer | Triggers the scrape daily at 02:00 |
# Service status
systemctl status ada-chatbot.service
# View API logs
journalctl -u ada-chatbot.service -f
# View last scrape run
journalctl -u ada-chatbot-scrape.service --no-pager
# Run scrape manually (same as ./filemagic.sh but via systemd)
sudo systemctl start ada-chatbot-scrape.service
# Check when scrape next fires
systemctl list-timers ada-chatbot-scrape.timerPrivate repo access uses a service token stored outside the codebase:
# Create the secrets directory
sudo mkdir -p /etc/ada-chatbot/secrets
sudo chmod 700 /etc/ada-chatbot/secrets
# Write the token (fine-grained PAT for ada-chatbot machine user)
echo -n "github_pat_..." | sudo tee /etc/ada-chatbot/secrets/github_token
sudo chmod 600 /etc/ada-chatbot/secrets/github_token
sudo chown bcritt:bcritt /etc/ada-chatbot/secrets/github_tokenconfig.yaml references the file path; the token value is never committed to git.
The built-in dashboard is at /dashboard. Since ada-lovelace is not directly accessible from a browser, use an SSH tunnel:
ssh -L 8000:localhost:8000 bcritt@ada-lovelace.stanford.edu
# Then open: http://localhost:8000/dashboardThe dashboard displays:
- KPI cards — total queries, cache hit rate, average latency, p95/p99 latency, error count
- Charts — queries by hour (last 24 h), cache hit/miss ratio, queries by cluster, latency distribution
- Top queries — 10 most frequent queries (grouped by semantic similarity)
Auto-refreshes every 30 seconds. All metrics are in-memory and reset on service restart.
| File | Contents |
|---|---|
logs/myapp.log |
Application log — startup, retrieval scores, cache hits/misses, errors |
logs/stats.jsonl |
Append-only per-query stats: timestamp, cluster, query, latency, cache hit, error |
logs/scrapers.log |
Output from scrape_srcc.py and scrape_static_docs.py |
stats.jsonl persists across restarts (append-only). The /stats endpoint reads from in-memory counters for fast access; stats.jsonl is the durable record.
All settings live in config.yaml. Shell scripts and the Python app both read from it directly — it is the single source of truth.
| Key | Default | Description |
|---|---|---|
path |
-- | Absolute path to the model directory on disk. Must contain config.json. |
type |
"qwen" |
Model architecture. Controls prompt format: "qwen" uses system/human chat roles; "llama" uses [INST] tags. |
device |
"cuda" |
"cuda" or "cpu". |
use_quantization |
false |
Whether to apply runtime quantization. false for pre-quantized models (AWQ). |
local_files_only |
true |
Prevent HuggingFace hub downloads at runtime. |
dtype |
"half" |
vLLM dtype. "half" (float16) required for AWQ models; "bfloat16" for non-quantized. |
| Key | Default | Description |
|---|---|---|
max_new_tokens |
1024 |
Maximum tokens in the generated response. Increase for longer answers (e.g., workshop lists). |
do_sample |
false |
false for greedy decoding (deterministic). true enables sampling. |
num_beams |
1 |
Beam search width. 1 disables beam search. |
temperature |
null |
Sampling temperature. null disables (greedy). |
| Key | Default | Description |
|---|---|---|
title |
"SRC Cluster Knowledge Base API" |
Shown in the OpenAPI docs page. |
description |
-- | API description for OpenAPI. |
version |
"1.0.0" |
API version string. |
| Key | Default | Description |
|---|---|---|
cors_origins |
[] |
List of allowed CORS origins for the frontend. |
| Key | Default | Description |
|---|---|---|
SEMANTIC_CACHE_ENABLED |
true |
Enable the semantic response cache. Requires sentence-transformers. |
SEMANTIC_CACHE_CLEAR_ON_STARTUP |
false |
Flush entire cache on every restart. Use true during development. In production, leave false — content-aware invalidation handles staleness automatically. |
SEMANTIC_CACHE_THRESHOLD |
0.70 |
Cosine similarity threshold (0.0--1.0). Lower = more permissive matching. 0.70--0.75 recommended. |
SEMANTIC_CACHE_DB |
"/workspace/.response_cache.db" |
Absolute path to the SQLite cache file. Use /workspace/ prefix so it lands in the bind-mounted host directory. |
LANGCHAIN_CACHE_DB |
"/workspace/.langchain.db" |
LangChain's internal LLM cache (exact prompt dedup). |
| Key | Default | Description |
|---|---|---|
MAX_RETRIEVED_DOCS |
5 |
Number of documents passed to the LLM as context. |
CHUNK_SIZE |
2500 |
Maximum characters per chunk. Docs are split on ## headers first; oversized sections fall back to character splitting. |
CHUNK_OVERLAP |
200 |
Character overlap between chunks (character splitter only). |
MIN_BM25_SCORE |
1.0 |
BM25 score floor. Documents below this are discarded. 0.0 disables filtering. |
HYBRID_ENABLED |
true |
Combine BM25 (keyword) and FAISS (semantic) retrieval via reciprocal rank fusion. |
VECTOR_MODEL |
-- | Path to the sentence-transformer embedding model. Currently gte-large-en-v1.5 (434M params, MTEB 65.4). |
RRF_K |
60 |
Reciprocal rank fusion constant. Higher values compress rank differences. |
FAISS_RRF_WEIGHT |
2.0 |
FAISS weight in RRF scoring. >1.0 prefers semantic matches over keyword matches. |
| Key | Default | Description |
|---|---|---|
GROUNDING_CHECK_ENABLED |
true |
Append a disclaimer when the answer discusses cluster-specific topics (Slurm commands, partitions, storage) but cites no retrieved documents. |
REFUSAL_DISCLAIMER |
-- | The disclaimer text appended to ungrounded answers. |
Maps cluster names to their documentation directories (relative to $PWD). Each cluster gets its own BM25 + FAISS retriever pair.
clusters:
sherlock: "docs/sherlock/"
farmshare: "docs/farmshare/"
oak: "docs/oak/"
elm: "docs/elm/"
carina: "docs/carina/"
nero: "docs/nero/"shared_docs: "docs/srcc/"Content in this directory is merged into every cluster's retriever at startup. Use for org-wide content (workshops, policies, people) that applies regardless of cluster.
| Key | Default | Description |
|---|---|---|
api_port |
8000 |
Port for the FastAPI server. |
host |
"ada-lovelace.stanford.edu" |
Hostname shown in startup messages and CORS. |
| Key | Default | Description |
|---|---|---|
log_dir |
"logs" |
Directory for application logs. |
stats_log |
"/workspace/logs/stats.jsonl" |
Per-query stats (latency, cache hit/miss, errors) in JSONL format. |
Legacy multi-worker configuration. Ignored by main.sh. Retained for reference only.
| Variable | Overrides | Description |
|---|---|---|
MODEL_PATH |
model.path |
Path to the LLM model directory |
API_PORT |
server.api_port |
Port for the API |
API_HOST |
server.host |
Hostname shown in startup messages |
LOG_DIR |
logging.log_dir |
Directory for app logs |
Pass variables into the Apptainer container with the APPTAINERENV_ prefix:
APPTAINERENV_PYTHONPATH=/workspace # required — activates sitecustomize.py shimsapichatbot/
|-- main.sh # Start the API (production or dev mode)
|-- filemagic.sh # Fetch and process all documentation
|-- setup_upgrade.sh # Download model, rebuild container, tune BM25
|-- stop_multi_gpu.sh # Stop worker/load-balancer processes
|-- comprehensive_test.sh # Full integration test suite
|-- test_cache.sh # Quick semantic cache smoke test
|-- gpu.sh # GPU compute diagnostic
|-- diagnose_gen.sh # Generation pipeline diagnostic
|-- generate_manifests.py # Content hashes for cache invalidation
|-- start_multi_gpu.sh # DEPRECATED -- exits with error
|-- bench_gemma.sh # Legacy benchmark (Gemma 2, not vLLM)
|-- benchmark_llama.sh # Legacy benchmark (TinyLlama, not vLLM)
|-- chatbot.def # Apptainer container definition (API)
|-- chatbot.sif # Built container (generated)
|-- file_processing.def # Apptainer container definition (doc processing)
|-- file_processing.sif # Built container (generated)
|-- config.yaml # Central configuration -- source of truth
|-- requirements.txt # Python dependencies for chatbot container
|-- sitecustomize.py # Apptainer/vLLM compatibility shims (auto-loaded)
|-- pynvml.py # Standalone pynvml mock (fallback shim)
|-- file_magic.py # Clone GitHub repos, flatten MkDocs docs to .md
|-- scrape_srcc.py # Scrape srcc.stanford.edu -> docs/srcc/
|-- scrape_static_docs.py # Scrape Carina/Nero docs -> docs/carina/, docs/nero/
|-- var_clean_up.py # Markdown variable substitution utility
|-- app/
| |-- main.py # FastAPI app and route definitions
| |-- rag_service.py # Model loading, retrieval, generation, caching
| |-- config.py # Settings -- reads config.yaml + env vars
| |-- prompts.py # System prompt and chat template logic
| |-- semantic_cache.py # SQLite semantic response cache
| |-- load_balancer.py # DEPRECATED -- old multi-worker load balancer
| +-- models.py # Pydantic request/response models
|-- docs/ # Generated documentation (output of filemagic.sh)
| |-- sherlock/
| |-- farmshare/
| |-- oak/
| |-- elm/
| |-- carina/
| |-- nero/
| +-- srcc/
+-- logs/
+-- myapp.log
# Verify the path in config.yaml exists and has the right files
ls /home/users/bcritt/apichatbot/models/Qwen3-32B-AWQ/config.json
python3 -c "import json; print(json.load(open('/home/users/bcritt/apichatbot/models/Qwen3-32B-AWQ/config.json'))['model_type'])"vLLM detects the model architecture from config.json. If the path is wrong or missing, vLLM may fall back to a cached model (with a different architecture and dtype requirements). AWQ models require dtype: "half" in config.yaml.
The gte-large-en-v1.5 embedding model uses custom Alibaba-NLP code (modeling.py, configuration.py). These files must be present in the model directory, and config.json must reference them as local paths (not Alibaba-NLP/new-impl--).
# Verify custom code files exist
ls /home/users/bcritt/apichatbot/models/gte-large-en-v1.5/*.py
# Verify config.json uses local paths
grep auto_map /home/users/bcritt/apichatbot/models/gte-large-en-v1.5/config.json
# Should show "configuration.NewConfig", not "Alibaba-NLP/new-impl--configuration.NewConfig"If the .py files are missing, download them from the Alibaba-NLP/new-impl repo:
huggingface-cli download Alibaba-NLP/new-impl --include "*.py" \
--local-dir /home/users/bcritt/apichatbot/models/gte-large-en-v1.5If config.json still references the remote prefix:
sed -i 's|Alibaba-NLP/new-impl--||g' /home/users/bcritt/apichatbot/models/gte-large-en-v1.5/config.jsonThe pynvml shims in sitecustomize.py are not being loaded. Ensure APPTAINERENV_PYTHONPATH=/workspace is set (this is done by main.sh). If running uvicorn directly, set PYTHONPATH=$PWD before the command.
Orphaned shared memory from a previous SIGKILL'd vLLM process. main.sh cleans this up automatically, but if running manually:
find /dev/shm -maxdepth 1 -user "$USER" -name "psm_*" -deletetail -100 logs/myapp.log | grep -E "startup|ERROR|FATAL|CRITICAL"
nvidia-smiContent-aware invalidation runs automatically on restart after a scrape. To force a full flush:
curl -X POST http://ada-lovelace.stanford.edu:8000/cache/clearapptainer build --sandbox test_sandbox chatbot.defvLLM runs with enforce_eager=True, which disables CUDA graph capture and torch.compile.
Root cause: Even with sitecustomize.py working around the NVML crash at startup, PyTorch's CUDACachingAllocator contains a separate hard C++ assertion on nvmlInit_v2_() at CUDACachingAllocator.cpp:1124 that fires during CUDA graph warmup. This assertion cannot be patched from Python.
Impact: ~5--15% higher per-token latency. Acceptable for a documentation chatbot.
To restore CUDA graphs (requires NVML to work natively inside the container):
- Verify:
apptainer exec --nv instance://chatapi python -c "import pynvml; pynvml.nvmlInit(); print('ok')" - If that fails, check that
--nvis bindinglibnvidia-ml.sofrom the host and that its version matches the driver:nvidia-smion the host vs.ldconfig -p | grep nvidia-mlinside the container - Once NVML initializes cleanly (no warnings in logs), remove
enforce_eager=Truefrom_load_model()inapp/rag_service.py
NCCL all-reduce between the two L4 GPUs uses TCP socket over loopback (NCCL_P2P_DISABLE=1, NCCL_SHM_DISABLE=1). P2P is blocked by IOMMU inside Apptainer; SHM has namespace collisions. This adds ~32 ms per forward pass and makes the initial warmup slow. See _load_model() in app/rag_service.py for details.
For issues with the clusters or API: srcc-support@stanford.edu
For debugging the chatbot:
- Run
./comprehensive_test.shto pinpoint which component is failing - Check
logs/myapp.log - Run
./gpu.shto verify GPU compute is working - Verify model path and
dtypeinconfig.yaml