diff --git a/demos/crisis-triage/README.md b/demos/crisis-triage/README.md new file mode 100644 index 0000000..d7c5509 --- /dev/null +++ b/demos/crisis-triage/README.md @@ -0,0 +1,100 @@ +# Demo 2: Crisis Triage with Ambiguous Incidents + +A municipal emergency operations center where LLM-powered dispatchers assess ambiguous crisis reports — demonstrating that keyword matching fails when incidents are deliberately misleading, but LLMs reading full impact descriptions can succeed. + +Target runtime: NetLogo 7.0.3 (`.nlogox` model format). + +## The Story + +Three dispatchers — Veteran, Rookie, and Analyst — receive a stream of crisis incidents. Each must assess severity and route to the right response tier. The incident bank includes **misleading cases** where surface keywords don't match reality: + +- "Toxic chemical spill at school" → actually spilled vinegar (LOW severity) +- "Minor water leak in basement" → threatening a neonatal ICU (CRITICAL severity) +- "Dog loose on highway" → causing a multi-vehicle pileup (HIGH severity) + +A naive keyword heuristic over-triggers on "toxic", "fire", "collapse" and fails on these cases. The LLM reads the full impact description and can assess correctly. + +## Quick Start + +1. Edit `config.txt` with your provider credentials (default: local Ollama). +2. Open `crisis-triage.nlogox` in NetLogo 7.0.3. +3. Click **setup** → dispatchers appear with persona labels, responders by tier. +4. Click **go** → incidents spawn, flow through the pipeline, monitors update. +5. Watch the output log for `[TRIAGE]`, `[ROUTE]`, and `[REFLECT]` messages. + +## How to Use + +### Controls + +| Control | Type | Purpose | +|---------|------|---------| +| `use-llm?` | Switch | Toggle between LLM dispatchers and naive heuristic | +| `memory-mode` | Chooser | persistent / per-episode / none | +| `reflection-interval` | Slider | Ticks between dispatcher self-reflection (0 = off) | +| `incident-rate` | Slider | Probability (%) of new incident per tick | +| `episode-length` | Slider | Ticks per episode boundary (0 = no episodes) | +| `add incident` | Button | Manually inject a random incident | +| `force reflect` | Button | Trigger immediate reflection for all dispatchers | + +### What to Observe + +- **Misleading%** — The key metric. Accuracy on misleading incidents where keywords don't match reality. +- **Triage Acc%** / **Route Acc%** — Overall accuracy vs ground truth. +- **Accuracy Over Time** plot — Watch how accuracy evolves, especially with memory. +- **Per-persona differences** — Veteran, Rookie, and Analyst may perform differently. +- **Reflection log** — Dispatchers reason about their own performance. + +## The A/B Experiment + +1. Run with `use-llm?` ON for 50+ ticks. Note the Misleading% metric. +2. Click setup again. Toggle `use-llm?` OFF. Run for 50+ ticks. +3. Compare: + - **Heuristic**: ~30% on misleading cases (keywords mislead it). + - **LLM**: Expected ~70%+ on misleading cases (reads actual impact). +4. Compare memory modes: Run with "persistent" vs "none" over multiple episodes. + +## LLM Primitives Exercised (8) + +| Primitive | Where | Paper Concept | +|-----------|-------|---------------| +| `llm:load-config` | `setup-llm` | Config management | +| `llm:set-history` | `setup-dispatchers` — persona injection | Personalization (Ch.2) | +| `llm:chat-with-template` | `triage-my-incidents` — severity assessment | Environment/Interface (Ch.1) | +| `llm:choose` | `route-my-incidents` — bounded tier selection | Bounded Rationality | +| `llm:history` | `dispatcher-reflect` — check history length | Memory (Ch.3) | +| `llm:chat` | `dispatcher-reflect` — freeform reflection | Reflection (Ch.3) | +| `llm:clear-history` | `handle-episode-boundary` — configurable reset | Memory ablation | +| `llm:active` | Monitor widget — show provider/model | Provider awareness | + +## Design Rationale + +**Why dispatchers use LLM, not responders**: Triage and routing are judgment calls where reading context matters. Case processing is mechanical — it doesn't benefit from language understanding. + +**Why no thinking/reasoning models**: With 3 dispatchers making 2+ LLM calls per tick, thinking models would add minutes of latency per tick. The triage task is classification, not multi-step reasoning. Standard `llm:chat-with-template` and `llm:choose` are the right tools. + +**Why `llm:choose` for routing**: Guarantees the output is one of the valid tier names, avoiding parsing failures from freeform text. + +**Why misleading incidents**: They make the LLM genuinely necessary. Without them, keyword matching achieves similar accuracy and the LLM adds cost without value. + +## Paper Connection + +This demo implements concepts from the Gao et al. (2312.11970) LLM-ABM survey: + +- **Personalization** (Ch.2): Dispatcher personas via `llm:set-history` produce different decisions from the same model. +- **Bounded Rationality**: `llm:choose` constrains decisions to valid options. +- **Memory** (Ch.3): Configurable memory modes show how history retention affects performance. +- **Reflection** (Ch.3): Dispatchers reason about their own accuracy and identify patterns. +- **Environment/Interface** (Ch.1): Templates structure how agents perceive incidents. + +## Files + +| File | Purpose | +|------|---------| +| `crisis-triage.nlogox` | NetLogo 7 simulation model | +| `triage-template.yaml` | Severity assessment prompt with anti-keyword-bias guidance | +| `dispatcher-template.yaml` | Documentation stub (routing uses `llm:choose`) | +| `config.txt` | LLM provider configuration | + +## Provider Configuration + +Default is local Ollama (no API key needed). See commented examples in `config.txt` for OpenAI, Claude, and Gemini. Never commit real API keys. diff --git a/demos/crisis-triage/config.txt b/demos/crisis-triage/config.txt new file mode 100644 index 0000000..e927166 --- /dev/null +++ b/demos/crisis-triage/config.txt @@ -0,0 +1,34 @@ +# Crisis Triage Demo LLM configuration +# Path is loaded by crisis-triage.nlogox via llm:load-config + +# Recommended local/default option (no cloud key required) +provider=ollama +model=qwen2.5:7b +base_url=http://localhost:11434 + +# Runtime behavior +temperature=0.2 +max_tokens=200 +timeout_seconds=45 + +# Optional cloud fallback examples (commented) +# provider=openai +# api_key=YOUR_OPENAI_API_KEY_HERE +# model=gpt-4o-mini +# temperature=0.2 +# max_tokens=200 +# timeout_seconds=45 + +# provider=claude +# api_key=YOUR_ANTHROPIC_API_KEY_HERE +# model=claude-3-5-haiku-latest +# temperature=0.2 +# max_tokens=200 +# timeout_seconds=45 + +# provider=gemini +# api_key=YOUR_GEMINI_API_KEY_HERE +# model=gemini-2.0-flash +# temperature=0.2 +# max_tokens=200 +# timeout_seconds=45 diff --git a/demos/crisis-triage/crisis-triage.nlogox b/demos/crisis-triage/crisis-triage.nlogox new file mode 100644 index 0000000..3b84ec6 --- /dev/null +++ b/demos/crisis-triage/crisis-triage.nlogox @@ -0,0 +1,1117 @@ + + + + create-dispatchers 1 [ + set persona-name item 0 p + set persona-prompt item 1 p + set my-triaged 0 + set my-correct-triage 0 + set my-routed 0 + set my-correct-route 0 + set shape "person" + set size 2.5 + set color blue + 2 + setxy px 14 + set label persona-name + set px px + 7 + + ;; Inject persona via llm:set-history if LLM is active + if llm-ready? and use-llm? [ + carefully [ + llm:set-history (list + (list "system" persona-prompt) + ) + ] [ + output-print (word "[SETUP] Failed to set history for " persona-name ": " error-message) + ] + ] + ] + ] +end + +;; --------------------------------------------------------------------------- +;; Setup Responders (3 BASIC cap=3, 3 EXPERT cap=2, 3 COORDINATOR cap=1) +;; --------------------------------------------------------------------------- + +to setup-responders + let base-x -12 + ;; BASIC responders + create-responders 3 [ + set tier "BASIC" + set capacity 3 + set current-load 0 + set resolved-count 0 + set shape "circle" + set size 1.5 + set color green + 1 + set label "B" + ] + let idx 0 + ask responders with [ tier = "BASIC" ] [ + setxy (base-x + idx * 3) -12 + set idx idx + 1 + ] + + ;; EXPERT responders + create-responders 3 [ + set tier "EXPERT" + set capacity 2 + set current-load 0 + set resolved-count 0 + set shape "circle" + set size 1.8 + set color orange + 1 + set label "E" + ] + set idx 0 + ask responders with [ tier = "EXPERT" ] [ + setxy (base-x + 10 + idx * 3) -12 + set idx idx + 1 + ] + + ;; COORDINATOR responders + create-responders 3 [ + set tier "COORDINATOR" + set capacity 1 + set current-load 0 + set resolved-count 0 + set shape "circle" + set size 2.1 + set color violet + 1 + set label "C" + ] + set idx 0 + ask responders with [ tier = "COORDINATOR" ] [ + setxy (base-x + 20 + idx * 3) -12 + set idx idx + 1 + ] +end + +;; --------------------------------------------------------------------------- +;; Incident Bank (30 incidents: 10 misleading + 10 clear + 10 borderline) +;; --------------------------------------------------------------------------- + +to build-incident-bank + ;; Each entry: [summary impact ground-truth-severity ground-truth-tier category] + ;; MISLEADING: keywords suggest one severity but actual impact warrants another + set incident-bank (list + ;; --- MISLEADING (10): keywords mislead naive classifiers --- + (list "Server room fire alarm triggered" + "Sensor malfunction confirmed; no smoke, no heat, systems normal" + "LOW" "BASIC" "misleading") + (list "Toxic chemical spill reported at school" + "Custodian knocked over a bottle of vinegar in the cafeteria; no injuries" + "LOW" "BASIC" "misleading") + (list "Building collapse alert downtown" + "Scaffolding panel fell on empty sidewalk at 3 AM; no structural damage to building" + "LOW" "BASIC" "misleading") + (list "Mass casualty incident at concert venue" + "Three people fainted from heat; all conscious, first aid on scene" + "MODERATE" "BASIC" "misleading") + (list "Explosion heard near hospital" + "Transformer blew on adjacent street; hospital on backup power, no injuries" + "MODERATE" "EXPERT" "misleading") + (list "Data center cooling failure" + "Regional hospital patient records, 911 dispatch system, and pharmacy networks all depend on this center; 30 minutes to critical thermal threshold" + "CRITICAL" "COORDINATOR" "misleading") + (list "Minor water leak in basement" + "Leak is in the electrical vault supplying the neonatal ICU; backup generators have 45 minutes of fuel" + "CRITICAL" "COORDINATOR" "misleading") + (list "Small kitchen fire at restaurant" + "Fire spreading to adjacent apartment building; 40 residents trapped above; fire department 20 minutes away" + "CRITICAL" "COORDINATOR" "misleading") + (list "Routine power fluctuation reported" + "Affecting traffic signals across 12 intersections during school dismissal; two near-miss accidents already" + "HIGH" "EXPERT" "misleading") + (list "Dog loose on highway" + "Causing multi-vehicle chain reaction on I-95; 6 cars involved, injuries reported, highway blocked both directions" + "HIGH" "EXPERT" "misleading") + + ;; --- CLEAR (10): keywords and impact align --- + (list "Multi-vehicle pileup on interstate" + "12 vehicles, multiple injuries confirmed, highway fully blocked, EMS requesting additional units" + "CRITICAL" "COORDINATOR" "clear") + (list "Warehouse fire with toxic plume" + "Residential area downwind being evacuated; 500+ people displaced; air quality hazardous" + "CRITICAL" "COORDINATOR" "clear") + (list "Earthquake damage to bridge" + "Visible structural cracks; bridge closed; 50,000 daily commuters affected; engineers en route" + "CRITICAL" "COORDINATOR" "clear") + (list "School bus accident with injuries" + "Bus overturned; 8 children with minor-moderate injuries; parents arriving at scene" + "HIGH" "EXPERT" "clear") + (list "Chemical plant pressure valve failure" + "Controlled venting in progress; shelter-in-place advisory for 2-mile radius; monitoring air quality" + "HIGH" "EXPERT" "clear") + (list "Hospital generator test failure" + "Backup generator failed routine test; primary power stable; repair crew dispatched for same-day fix" + "MODERATE" "BASIC" "clear") + (list "Broken water main on residential street" + "Low-pressure water to 30 homes; repair crew en route; estimated 4-hour fix" + "MODERATE" "BASIC" "clear") + (list "Traffic signal malfunction at intersection" + "Single intersection flashing red; police directing traffic; no accidents" + "LOW" "BASIC" "clear") + (list "Park trail flooding after rain" + "Trails closed; no hikers in area; water receding naturally" + "LOW" "BASIC" "clear") + (list "Streetlight outage on residential block" + "Six streetlights out; residents notified; maintenance scheduled for morning" + "LOW" "BASIC" "clear") + + ;; --- BORDERLINE (10): genuinely ambiguous, reasonable people could disagree --- + (list "Subway train stalled between stations" + "200 passengers stuck for 25 minutes; ventilation working; rescue train dispatched; some passengers anxious" + "MODERATE" "EXPERT" "borderline") + (list "Power outage at nursing home" + "Backup generator active; 60 residents comfortable; generator fuel for 8 hours; utility ETA unknown" + "HIGH" "EXPERT" "borderline") + (list "Gas smell reported near elementary school" + "School in session; gas company en route; no readings yet; precautionary evacuation being considered" + "HIGH" "EXPERT" "borderline") + (list "Protest blocking major intersection" + "500 people; peaceful but not dispersing; ambulance rerouting adds 8 minutes to hospital route" + "MODERATE" "EXPERT" "borderline") + (list "Crane malfunction at construction site" + "Crane arm stuck over occupied building; no immediate danger but wind advisory in effect for afternoon" + "HIGH" "EXPERT" "borderline") + (list "River level rising near flood stage" + "2 feet below flood level; rain expected to continue 6 hours; 200 homes in potential flood zone" + "HIGH" "COORDINATOR" "borderline") + (list "Suspicious package at government building" + "Building evacuated; bomb squad 15 minutes away; 300 workers displaced; likely false alarm based on description" + "MODERATE" "EXPERT" "borderline") + (list "Internet outage affecting emergency services" + "911 calls routing to backup center; 12-second additional delay per call; estimated 2-hour repair" + "HIGH" "EXPERT" "borderline") + (list "Heat wave shelter capacity reached" + "Main cooling center full at 150 people; overflow into library planned; 3 elderly residents showing heat stress" + "MODERATE" "EXPERT" "borderline") + (list "Airport runway incursion reported" + "Ground vehicle crossed active runway; no aircraft in immediate path; runway closed for inspection" + "MODERATE" "EXPERT" "borderline") + ) +end + +;; =========================================================================== +;; GO LOOP +;; =========================================================================== + +to go + ;; Episode boundary check + handle-episode-boundary + + ;; Spawn new incidents + if random 100 < incident-rate [ + spawn-incident + ] + + ;; Dispatchers triage and route + ask dispatchers [ + triage-my-incidents + route-my-incidents + ] + + ;; Responders process active cases + process-active-cases + + ;; Check deadlines + check-deadlines + + ;; Reflection at intervals + if reflection-interval > 0 and ticks > 0 and ticks mod reflection-interval = 0 [ + ask dispatchers [ + dispatcher-reflect + ] + ] + + set episode-tick-counter episode-tick-counter + 1 + tick +end + +;; =========================================================================== +;; INCIDENT SPAWNING +;; =========================================================================== + +to spawn-incident + let picked one-of incident-bank + create-incidents 1 [ + set summary item 0 picked + set impact item 1 picked + set ground-truth-severity item 2 picked + set ground-truth-tier item 3 picked + set incident-category item 4 picked + set assessed-severity "" + set assessed-tier "" + set queue-state "new" + set triage-correct? false + set route-correct? false + set created-at ticks + set assigned-responder nobody + + ;; Deadline: severity-dependent time window + let window severity-deadline ground-truth-severity + set deadline ticks + window + + set shape "circle" + set size 1.0 + set color yellow + setxy (random-xcor * 0.5) (9 + random 3) + set label "" + ] +end + +;; Manual incident injection button +to add-incident + spawn-incident + output-print "[MANUAL] Incident added" +end + +to-report severity-deadline [ sev ] + if sev = "LOW" [ report 30 ] + if sev = "MODERATE" [ report 20 ] + if sev = "HIGH" [ report 12 ] + report 8 ;; CRITICAL +end + +;; =========================================================================== +;; TRIAGE (dispatchers assess severity via llm:chat-with-template) +;; =========================================================================== + +to triage-my-incidents + ;; Each dispatcher picks one untriaged incident per tick + let target one-of incidents with [ queue-state = "new" ] + if target = nobody [ stop ] + + let sev "" + + ifelse llm-ready? and use-llm? [ + ;; LLM triage via template + carefully [ + let response llm:chat-with-template triage-template-path (list + (list "persona" persona-prompt) + (list "episode" (word current-episode)) + (list "tick" (word ticks)) + (list "incident" [summary] of target) + (list "impact" [impact] of target) + ) + set sev extract-severity response + output-print (word "[TRIAGE:" persona-name "] " [summary] of target " -> " sev) + ] [ + output-print (word "[TRIAGE:" persona-name "] LLM failed: " error-message) + set sev "" + ] + ] [ + ;; Heuristic triage (naive keyword matching — deliberately bad on misleading cases) + set sev heuristic-triage [summary] of target [impact] of target + output-print (word "[TRIAGE:heuristic] " [summary] of target " -> " sev) + ] + + ;; Fallback if empty + if sev = "" [ set sev "MODERATE" ] + + ;; Score + let truth [ground-truth-severity] of target + let is-correct? (sev = truth) + + set total-triaged total-triaged + 1 + set my-triaged my-triaged + 1 + if is-correct? [ + set correct-triage correct-triage + 1 + set my-correct-triage my-correct-triage + 1 + ] + if [incident-category] of target = "misleading" [ + set misleading-triaged misleading-triaged + 1 + if is-correct? [ set misleading-correct misleading-correct + 1 ] + ] + + ask target [ + set assessed-severity sev + set triage-correct? is-correct? + set queue-state "triaged" + set color severity-color sev + setxy xcor (3 + random 3) + ] +end + +;; Heuristic triage: deliberately naive keyword matching +to-report heuristic-triage [ s i ] + let text (word s " " i) + ;; Keywords that trigger high severity regardless of actual impact + if has-word? text "fire" [ report "CRITICAL" ] + if has-word? text "explosion" [ report "CRITICAL" ] + if has-word? text "collapse" [ report "CRITICAL" ] + if has-word? text "toxic" [ report "CRITICAL" ] + if has-word? text "casualty" [ report "CRITICAL" ] + if has-word? text "chemical" [ report "HIGH" ] + if has-word? text "trapped" [ report "CRITICAL" ] + if has-word? text "spill" [ report "HIGH" ] + if has-word? text "suspicious" [ report "HIGH" ] + if has-word? text "earthquake" [ report "CRITICAL" ] + if has-word? text "flood" [ report "HIGH" ] + if has-word? text "outage" [ report "HIGH" ] + if has-word? text "injuries" [ report "HIGH" ] + if has-word? text "accident" [ report "HIGH" ] + if has-word? text "alarm" [ report "HIGH" ] + if has-word? text "evacuat" [ report "CRITICAL" ] + ;; Default for anything without scary keywords + report "MODERATE" +end + +to-report has-word? [ text word-fragment ] + report position word-fragment text != false or position (lower-case-first word-fragment) text != false +end + +to-report lower-case-first [ s ] + ;; Simple helper: just return the string as-is since NetLogo string matching is case-sensitive + ;; and our keywords are already lowercase + report s +end + +to-report extract-severity [ response ] + if position "CRITICAL" response != false [ report "CRITICAL" ] + if position "HIGH" response != false [ report "HIGH" ] + if position "MODERATE" response != false [ report "MODERATE" ] + if position "LOW" response != false [ report "LOW" ] + report "" +end + +to-report severity-color [ sev ] + if sev = "LOW" [ report 55 ] ;; green + if sev = "MODERATE" [ report 45 ] ;; yellow-green + if sev = "HIGH" [ report 25 ] ;; orange + if sev = "CRITICAL" [ report 15 ] ;; red + report 5 ;; grey +end + +;; =========================================================================== +;; ROUTING (dispatchers route via llm:choose) +;; =========================================================================== + +to route-my-incidents + let target one-of incidents with [ queue-state = "triaged" ] + if target = nobody [ stop ] + + let chosen-tier "" + let choices (list "BASIC" "EXPERT" "COORDINATOR" "HOLD") + + ifelse llm-ready? and use-llm? [ + ;; LLM routing via llm:choose + carefully [ + let prompt (word + "Incident: " [summary] of target "\n" + "Severity: " [assessed-severity] of target "\n" + "Impact: " [impact] of target "\n" + "Current load — BASIC: " count-active-tier "BASIC" "/9" + ", EXPERT: " count-active-tier "EXPERT" "/6" + ", COORDINATOR: " count-active-tier "COORDINATOR" "/3" "\n" + "Routing rules based on severity:\n" + " - LOW severity -> BASIC\n" + " - MODERATE severity -> BASIC (or EXPERT if BASIC is full)\n" + " - HIGH severity -> EXPERT\n" + " - CRITICAL severity -> COORDINATOR\n" + " - HOLD only if the appropriate tier AND all higher tiers are at capacity.\n" + "The assessed severity for this incident is " [assessed-severity] of target ". Apply the rules above." + ) + set chosen-tier llm:choose prompt choices + output-print (word "[ROUTE:" persona-name "] " [summary] of target " -> " chosen-tier) + ] [ + output-print (word "[ROUTE:" persona-name "] LLM choose failed: " error-message) + set chosen-tier "" + ] + ] [ + ;; Heuristic routing + set chosen-tier heuristic-route [assessed-severity] of target + output-print (word "[ROUTE:heuristic] " [summary] of target " -> " chosen-tier) + ] + + if chosen-tier = "" [ set chosen-tier heuristic-route [assessed-severity] of target ] + if chosen-tier = "HOLD" [ + output-print (word "[HOLD] " [summary] of target " — waiting for capacity") + stop + ] + + ;; Find available responder in chosen tier + let worker find-responder chosen-tier + if worker = nobody [ + ;; Try escalation + set worker find-responder escalation-tier chosen-tier + if worker != nobody [ + set total-escalated total-escalated + 1 + set chosen-tier [tier] of worker + ] + ] + if worker = nobody [ stop ] ;; No capacity anywhere + + ;; Score routing + let truth [ground-truth-tier] of target + let is-correct? (chosen-tier = truth) + set total-routed total-routed + 1 + set my-routed my-routed + 1 + if is-correct? [ + set correct-route correct-route + 1 + set my-correct-route my-correct-route + 1 + ] + + ask worker [ + set current-load current-load + 1 + ] + + ask target [ + set assessed-tier chosen-tier + set route-correct? is-correct? + set queue-state "active" + set assigned-responder worker + ;; Move toward responder zone + setxy ([xcor] of worker + random-float 2 - 1) ([ycor] of worker + 3) + set label "" + ] +end + +to-report heuristic-route [ sev ] + if sev = "LOW" [ report "BASIC" ] + if sev = "MODERATE" [ report "BASIC" ] + if sev = "HIGH" [ report "EXPERT" ] + report "COORDINATOR" +end + +to-report escalation-tier [ current-tier ] + if current-tier = "BASIC" [ report "EXPERT" ] + if current-tier = "EXPERT" [ report "COORDINATOR" ] + report "COORDINATOR" +end + +to-report find-responder [ tier-name ] + let candidates responders with [ tier = tier-name and current-load < capacity ] + ifelse any? candidates [ + report min-one-of candidates [ current-load ] + ] [ + report nobody + ] +end + +to-report count-active-tier [ tier-name ] + report count incidents with [ queue-state = "active" and assessed-tier = tier-name ] +end + +;; =========================================================================== +;; PROCESSING + DEADLINES +;; =========================================================================== + +to process-active-cases + ask incidents with [ queue-state = "active" ] [ + let chance completion-probability assessed-tier + if random-float 1 < chance [ + resolve-incident self + ] + ] +end + +to-report completion-probability [ tier-name ] + if tier-name = "BASIC" [ report 0.15 ] + if tier-name = "EXPERT" [ report 0.20 ] + if tier-name = "COORDINATOR" [ report 0.25 ] + report 0.10 +end + +to resolve-incident [ inc ] + let worker [assigned-responder] of inc + if worker != nobody [ + ask worker [ + set current-load max (list 0 (current-load - 1)) + set resolved-count resolved-count + 1 + ] + ] + + set total-resolved total-resolved + 1 + set total-response-ticks total-response-ticks + (ticks - [created-at] of inc) + + ask inc [ + set queue-state "resolved" + set color grey + 2 + set size 0.6 + setxy xcor (-15 + random-float 1) + set label "" + ] +end + +to check-deadlines + ask incidents with [ queue-state = "active" and ticks > deadline ] [ + set queue-state "late" + set total-late total-late + 1 + set color magenta + output-print (word "[LATE] " summary " — exceeded deadline at tick " ticks) + + ;; Try to escalate late cases + let current-tier assessed-tier + let higher-tier escalation-tier current-tier + if higher-tier != current-tier [ + let new-worker find-responder higher-tier + if new-worker != nobody [ + ;; Release old responder + if assigned-responder != nobody [ + ask assigned-responder [ + set current-load max (list 0 (current-load - 1)) + ] + ] + ask new-worker [ set current-load current-load + 1 ] + set assigned-responder new-worker + set assessed-tier higher-tier + set queue-state "active" + set total-escalated total-escalated + 1 + output-print (word "[ESCALATE] " summary " -> " higher-tier) + ] + ] + ] + + ;; Also let late-but-still-processing cases resolve + ask incidents with [ queue-state = "late" ] [ + let chance completion-probability assessed-tier + if random-float 1 < chance [ + resolve-incident self + ] + ] +end + +;; =========================================================================== +;; REFLECTION (dispatchers reflect on performance via llm:chat) +;; =========================================================================== + +to dispatcher-reflect + if not llm-ready? or not use-llm? [ stop ] + if my-triaged = 0 [ stop ] + + ;; Only reflect if enough history accumulated + let hist-len 0 + carefully [ + set hist-len length llm:history + ] [ + set hist-len 0 + ] + if hist-len < 4 [ stop ] + + let my-triage-acc ifelse-value (my-triaged > 0) [ precision (my-correct-triage / my-triaged * 100) 1 ] [ 0 ] + let my-route-acc ifelse-value (my-routed > 0) [ precision (my-correct-route / my-routed * 100) 1 ] [ 0 ] + + carefully [ + let reflection llm:chat (word + "REFLECTION — You are " persona-name " dispatcher. Review your performance:\n" + "Triage accuracy: " my-triage-acc "% (" my-correct-triage "/" my-triaged ")\n" + "Routing accuracy: " my-route-acc "% (" my-correct-route "/" my-routed ")\n" + "Episode: " current-episode ", Tick: " ticks "\n" + "What patterns are you noticing? What would you do differently? " + "Keep your reflection to 2-3 sentences." + ) + output-print (word "[REFLECT:" persona-name "] " reflection) + ] [ + output-print (word "[REFLECT:" persona-name "] Failed: " error-message) + ] +end + +;; Manual reflection trigger +to force-reflect + ask dispatchers [ dispatcher-reflect ] +end + +;; =========================================================================== +;; EPISODE BOUNDARY + MEMORY MANAGEMENT +;; =========================================================================== + +to handle-episode-boundary + if episode-length = 0 [ stop ] ;; No episode boundaries + if episode-tick-counter < episode-length [ stop ] + + ;; Episode ended + set current-episode current-episode + 1 + set episode-tick-counter 0 + output-print (word "[EPISODE] Starting episode " current-episode " | Memory mode: " memory-mode) + + ask dispatchers [ + if memory-mode = "per-episode" [ + ;; Clear and re-inject persona + carefully [ + llm:clear-history + llm:set-history (list + (list "system" persona-prompt) + ) + output-print (word "[MEMORY:" persona-name "] History cleared, persona re-injected") + ] [ + output-print (word "[MEMORY:" persona-name "] Reset failed: " error-message) + ] + ] + if memory-mode = "none" [ + ;; Clear everything every episode + carefully [ + llm:clear-history + output-print (word "[MEMORY:" persona-name "] History fully cleared") + ] [ + output-print (word "[MEMORY:" persona-name "] Clear failed: " error-message) + ] + ] + ;; "persistent" mode: do nothing, history accumulates + ] +end + +;; =========================================================================== +;; METRIC REPORTERS +;; =========================================================================== + +to-report triage-accuracy + ifelse total-triaged > 0 + [ report precision (correct-triage / total-triaged * 100) 1 ] + [ report 0 ] +end + +to-report route-accuracy + ifelse total-routed > 0 + [ report precision (correct-route / total-routed * 100) 1 ] + [ report 0 ] +end + +to-report late-rate + let total-dispatched total-routed + ifelse total-dispatched > 0 + [ report precision (total-late / total-dispatched * 100) 1 ] + [ report 0 ] +end + +to-report escalation-rate + ifelse total-routed > 0 + [ report precision (total-escalated / total-routed * 100) 1 ] + [ report 0 ] +end + +to-report avg-response-time + ifelse total-resolved > 0 + [ report precision (total-response-ticks / total-resolved) 1 ] + [ report 0 ] +end + +to-report misleading-accuracy + ifelse misleading-triaged > 0 + [ report precision (misleading-correct / misleading-triaged * 100) 1 ] + [ report 0 ] +end + +to-report persona-accuracy-report + report (word + map [ d -> + (word [persona-name] of d ": " + ifelse-value ([my-triaged] of d > 0) + [ (word precision ([my-correct-triage] of d / [my-triaged] of d * 100) 0 "%") ] + [ "N/A" ] + ) + ] sort dispatchers + ) +end + +to-report veteran-accuracy + let d one-of dispatchers with [persona-name = "Veteran"] + if d = nobody [ report "N/A" ] + ifelse [my-triaged] of d > 0 + [ report (word precision ([my-correct-triage] of d / [my-triaged] of d * 100) 0 "%") ] + [ report "N/A" ] +end + +to-report rookie-accuracy + let d one-of dispatchers with [persona-name = "Rookie"] + if d = nobody [ report "N/A" ] + ifelse [my-triaged] of d > 0 + [ report (word precision ([my-correct-triage] of d / [my-triaged] of d * 100) 0 "%") ] + [ report "N/A" ] +end + +to-report analyst-accuracy + let d one-of dispatchers with [persona-name = "Analyst"] + if d = nobody [ report "N/A" ] + ifelse [my-triaged] of d > 0 + [ report (word precision ([my-correct-triage] of d / [my-triaged] of d * 100) 0 "%") ] + [ report "N/A" ] +end + +to-report llm-status + let result "N/A" + carefully [ + set result (word llm:active) + ] [ + ;; keep default + ] + report result +end + +to-report queue-new-count + report count incidents with [ queue-state = "new" ] +end + +to-report queue-triaged-count + report count incidents with [ queue-state = "triaged" ] +end + +to-report queue-active-count + report count incidents with [ queue-state = "active" or queue-state = "late" ] +end + +to-report queue-resolved-count + report count incidents with [ queue-state = "resolved" ] +end +]]> + + + + + + + + + + + + + + + + llm-status + current-episode + memory-mode + queue-new-count + queue-triaged-count + queue-active-count + triage-accuracy + route-accuracy + misleading-accuracy + avg-response-time + veteran-accuracy + rookie-accuracy + analyst-accuracy + late-rate + escalation-rate + total-resolved + + + + + + plot triage-accuracy + + + + plot route-accuracy + + + + plot misleading-accuracy + + + + + + + + plot queue-new-count + + + + plot queue-active-count + + + + plot total-resolved + + + + plot total-late + + + + + ## Crisis Triage with Ambiguous Incidents + +### The Story + +A municipal emergency operations center receives a stream of crisis reports. Three dispatchers — a Veteran, a Rookie, and an Analyst — must assess each incident's severity and route it to the appropriate response tier (Basic, Expert, or Coordinator). + +The twist: many incidents are **deliberately misleading**. A "toxic chemical spill at a school" turns out to be spilled vinegar. A "minor water leak" threatens a neonatal ICU. Naive keyword matching fails on these cases — but an LLM reading the full impact description can get them right. + +### What This Demonstrates + +This demo exercises 8 LLM extension primitives, grounded in the Gao et al. (2312.11970) LLM-ABM survey: + +| Primitive | Where Used | Paper Concept | +|-----------|-----------|---------------| +| `llm:load-config` | Setup | Config management | +| `llm:set-history` | Dispatcher personas | Personalization (Ch.2) | +| `llm:chat-with-template` | Severity triage | Environment/Interface (Ch.1) | +| `llm:choose` | Tier routing | Bounded Rationality | +| `llm:history` | Reflection trigger | Memory (Ch.3) | +| `llm:chat` | Dispatcher reflection | Reflection (Ch.3) | +| `llm:clear-history` | Episode boundaries | Memory ablation | +| `llm:active` | Status monitor | Provider awareness | + +### Quick Start + +1. Edit `config.txt` with your provider credentials (default: local Ollama). +2. Click **setup**. +3. Click **go**. +4. Watch the output log for `[TRIAGE]`, `[ROUTE]`, and `[REFLECT]` messages. +5. Compare the **Misleading%** monitor — this is where the LLM shines vs heuristics. + +### The A/B Experiment + +Toggle **use-llm?** OFF to switch to pure heuristic mode: + +- **Heuristic mode**: Keyword matching triggers on "fire", "toxic", "collapse" etc. Works fine on clear cases (~70%) but scores ~30% on misleading cases where keywords don't match reality. +- **LLM mode**: Reads the full impact description. Expected ~70%+ on misleading cases. + +Run both modes for 50+ ticks and compare the Accuracy Over Time plot. + +### Controls + +- **use-llm?**: Toggle between LLM dispatchers and naive heuristic +- **memory-mode**: How dispatcher memory works across episodes + - *persistent*: Full conversation history retained + - *per-episode*: History cleared each episode, persona re-injected + - *none*: History cleared each episode, no persona +- **reflection-interval**: How often dispatchers reflect on their performance (0 = never) +- **incident-rate**: Probability (%) of a new incident each tick +- **episode-length**: Ticks per episode (0 = no episodes) + +### What to Observe + +- **Triage Acc%**: How often dispatchers match ground-truth severity +- **Misleading%**: Accuracy specifically on misleading incidents (the key metric) +- **Route Acc%**: How often incidents go to the correct response tier +- **Per-persona differences**: Veteran vs Rookie vs Analyst performance +- **Reflection output**: Watch dispatchers reason about their own performance in the log +- **Memory effects**: Compare persistent vs per-episode vs none over multiple episodes + +### Design Rationale + +**Why dispatchers (not responders) use LLM**: Triage and routing are judgment calls where context matters. Processing is mechanical — it doesn't benefit from language understanding. + +**Why no thinking/reasoning models**: Speed (3 dispatchers x 2 calls/tick would take minutes with thinking), cost (300+ calls per session), and overkill for classification tasks. + +**Why `llm:choose` for routing**: Guarantees output is one of the valid tiers, avoiding parsing failures. The extension handles fuzzy matching and falls back to random choice if the LLM response can't be parsed. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + setup repeat 30 [ go ] + diff --git a/demos/crisis-triage/dispatcher-template.yaml b/demos/crisis-triage/dispatcher-template.yaml new file mode 100644 index 0000000..f018c6d --- /dev/null +++ b/demos/crisis-triage/dispatcher-template.yaml @@ -0,0 +1,17 @@ +# ABOUTME: Documentation stub for the dispatcher routing step. +# ABOUTME: Routing now uses llm:choose for bounded tier selection instead of template parsing. +# +# This file is kept for reference. The actual routing in crisis-triage.nlogox +# uses llm:choose with choices ["BASIC" "EXPERT" "COORDINATOR" "HOLD"], +# which guarantees the response is one of the valid tiers. +# +# The dispatcher's conversational context (persona, history) is maintained +# via llm:set-history and accumulated through llm:chat-with-template calls. +system: "You are a crisis operations dispatcher. Route incidents to the appropriate response tier." +template: | + Severity: {severity} + Incident: {incident} + Current load — BASIC: {basic_load}, EXPERT: {expert_load}, COORDINATOR: {coordinator_load} + + Choose the best response tier considering severity and current workload. + Respond with EXACTLY ONE of: BASIC, EXPERT, COORDINATOR, HOLD diff --git a/demos/crisis-triage/tests/README.md b/demos/crisis-triage/tests/README.md new file mode 100644 index 0000000..edd3703 --- /dev/null +++ b/demos/crisis-triage/tests/README.md @@ -0,0 +1,20 @@ +# Crisis Triage Demo Tests + +Run from repository root: + +```bash +python -m unittest discover -s demos/crisis-triage/tests -p "test_*.py" -v +``` + +These tests validate (29 tests, no API calls): + +- Presence of all required demo files +- Breed declarations (dispatchers, incidents, responders) +- Required procedures (setup, triage, routing, reflection, episode boundary) +- All 8 LLM primitives present in code +- Template placeholder consistency with model substitutions +- Config key completeness and max_tokens=200 +- README documentation sections +- XML structure (widgets, shapes, plots, CDATA) +- Incident bank has 30 entries (10 misleading + 10 clear + 10 borderline) +- Procedure block matching (every `to` has an `end`) diff --git a/demos/crisis-triage/tests/__pycache__/test_crisis_triage.cpython-312.pyc b/demos/crisis-triage/tests/__pycache__/test_crisis_triage.cpython-312.pyc new file mode 100644 index 0000000..78a8e71 Binary files /dev/null and b/demos/crisis-triage/tests/__pycache__/test_crisis_triage.cpython-312.pyc differ diff --git a/demos/crisis-triage/tests/test_crisis_triage.py b/demos/crisis-triage/tests/test_crisis_triage.py new file mode 100644 index 0000000..183920d --- /dev/null +++ b/demos/crisis-triage/tests/test_crisis_triage.py @@ -0,0 +1,287 @@ +# ABOUTME: Static validation tests for the crisis triage demo. +# ABOUTME: Tests file structure, XML format, code structure, and template consistency. + +import re +import unittest +import xml.etree.ElementTree as ET +from pathlib import Path + + +DEMO_DIR = Path(__file__).resolve().parents[1] +MODEL_PATH = DEMO_DIR / "crisis-triage.nlogox" +TRIAGE_TEMPLATE_PATH = DEMO_DIR / "triage-template.yaml" +DISPATCHER_TEMPLATE_PATH = DEMO_DIR / "dispatcher-template.yaml" +CONFIG_PATH = DEMO_DIR / "config.txt" +README_PATH = DEMO_DIR / "README.md" + + +def read(path: Path) -> str: + return path.read_text(encoding="utf-8") + + +def parse_model() -> ET.Element: + return ET.parse(MODEL_PATH).getroot() + + +def model_code_only() -> str: + root = parse_model() + code_elem = root.find("code") + if code_elem is None or code_elem.text is None: + raise AssertionError("unable to extract content from model XML") + return code_elem.text + + +def parse_config(path: Path) -> dict[str, str]: + data: dict[str, str] = {} + for raw in read(path).splitlines(): + line = raw.strip() + if not line or line.startswith("#"): + continue + if "=" not in line: + continue + key, value = line.split("=", 1) + data[key.strip()] = value.strip() + return data + + +class TestCrisisTriageArtifacts(unittest.TestCase): + def test_required_files_exist(self) -> None: + required = [ + MODEL_PATH, + TRIAGE_TEMPLATE_PATH, + DISPATCHER_TEMPLATE_PATH, + CONFIG_PATH, + README_PATH, + ] + for path in required: + self.assertTrue(path.exists(), f"missing file: {path}") + + def test_model_declares_breeds(self) -> None: + code = model_code_only() + self.assertIn("breed [ dispatchers dispatcher ]", code) + self.assertIn("breed [ incidents incident ]", code) + self.assertIn("breed [ responders responder ]", code) + + def test_model_contains_required_procedures(self) -> None: + code = model_code_only() + procedures = [ + "to setup", + "to setup-llm", + "to setup-dispatchers", + "to setup-responders", + "to go", + "to triage-my-incidents", + "to route-my-incidents", + "to process-active-cases", + "to dispatcher-reflect", + "to handle-episode-boundary", + ] + for proc in procedures: + self.assertIn(proc, code, f"missing procedure: {proc}") + + def test_model_uses_llm_config_and_template(self) -> None: + code = model_code_only() + self.assertIn('set config-path "demos/crisis-triage/config.txt"', code) + self.assertIn('set triage-template-path "demos/crisis-triage/triage-template.yaml"', code) + self.assertIn("llm:chat-with-template triage-template-path", code) + + def test_model_uses_all_eight_primitives(self) -> None: + code = model_code_only() + primitives = [ + "llm:load-config", + "llm:set-history", + "llm:chat-with-template", + "llm:choose", + "llm:history", + "llm:chat", + "llm:clear-history", + "llm:active", + ] + for prim in primitives: + self.assertIn(prim, code, f"missing LLM primitive: {prim}") + + def test_triage_template_placeholders_match_model(self) -> None: + template = read(TRIAGE_TEMPLATE_PATH) + placeholders = set(re.findall(r"\{([a-zA-Z_][a-zA-Z0-9_]*)\}", template)) + self.assertEqual( + placeholders, + {"persona", "episode", "tick", "incident", "impact"}, + ) + + def test_config_has_required_keys(self) -> None: + config = parse_config(CONFIG_PATH) + for key in ["provider", "model", "temperature", "max_tokens", "timeout_seconds"]: + self.assertIn(key, config, f"missing key in config: {key}") + + def test_config_max_tokens_is_200(self) -> None: + config = parse_config(CONFIG_PATH) + self.assertEqual(config["max_tokens"], "200") + + def test_readme_has_core_sections(self) -> None: + readme = read(README_PATH) + for text in [ + "Quick Start", + "A/B Experiment", + "Design Rationale", + "Paper Connection", + ]: + self.assertIn(text, readme) + + +class TestModelXmlParsing(unittest.TestCase): + def setUp(self) -> None: + self.root = parse_model() + + def test_model_parses_as_valid_xml(self) -> None: + self.assertEqual(self.root.tag, "model") + + def test_code_element_contains_cdata_content(self) -> None: + code_elem = self.root.find("code") + self.assertIsNotNone(code_elem, "missing element") + self.assertIsNotNone(code_elem.text, " element has no text content") + self.assertIn("extensions [ llm ]", code_elem.text) + + def test_raw_file_preserves_cdata_wrapping(self) -> None: + raw = read(MODEL_PATH) + self.assertIn("", raw) + + def test_widgets_section_has_expected_children(self) -> None: + widgets = self.root.find("widgets") + self.assertIsNotNone(widgets, "missing section") + child_tags = [child.tag for child in widgets] + self.assertIn("view", child_tags) + self.assertIn("button", child_tags) + self.assertIn("monitor", child_tags) + self.assertIn("switch", child_tags) + self.assertIn("chooser", child_tags) + self.assertIn("slider", child_tags) + self.assertIn("plot", child_tags) + + def test_widgets_button_count(self) -> None: + widgets = self.root.find("widgets") + buttons = widgets.findall("button") + self.assertEqual(len(buttons), 4, "expected 4 buttons: setup, go, add-incident, force-reflect") + + def test_widgets_monitor_count(self) -> None: + widgets = self.root.find("widgets") + monitors = widgets.findall("monitor") + self.assertGreaterEqual(len(monitors), 12, "expected at least 12 monitors") + + def test_widgets_plot_count(self) -> None: + widgets = self.root.find("widgets") + plots = widgets.findall("plot") + self.assertEqual(len(plots), 2, "expected 2 plots: Accuracy Over Time, Case Flow") + + def test_turtle_shapes_defined(self) -> None: + shapes = self.root.find("turtleShapes") + self.assertIsNotNone(shapes, "missing section") + shape_names = [s.get("name") for s in shapes.findall("shape")] + self.assertIn("default", shape_names) + self.assertIn("circle", shape_names) + self.assertIn("person", shape_names) + + +class TestModelStructure(unittest.TestCase): + def setUp(self) -> None: + self.root = parse_model() + + def test_netlogo_version_is_7_0_3(self) -> None: + version = self.root.get("version") + self.assertEqual(version, "NetLogo 7.0.3") + + def test_required_top_level_sections_exist(self) -> None: + required_sections = [ + "code", "widgets", "info", "turtleShapes", "linkShapes", + "previewCommands", + ] + present = {child.tag for child in self.root} + for section in required_sections: + self.assertIn(section, present, f"missing top-level section: {section}") + + def test_info_section_not_empty(self) -> None: + info = self.root.find("info") + self.assertIsNotNone(info, "missing section") + self.assertTrue( + info.text and len(info.text.strip()) > 0, + " section is empty", + ) + + def test_preview_commands_present(self) -> None: + preview = self.root.find("previewCommands") + self.assertIsNotNone(preview) + self.assertIn("setup", preview.text) + + def test_link_shapes_has_default(self) -> None: + link_shapes = self.root.find("linkShapes") + self.assertIsNotNone(link_shapes, "missing ") + names = [s.get("name") for s in link_shapes.findall("shape")] + self.assertIn("default", names) + + +class TestBehaviorRegression(unittest.TestCase): + def setUp(self) -> None: + self.code = model_code_only() + + def test_extensions_declaration_present(self) -> None: + self.assertIn("extensions [ llm ]", self.code) + + def test_chat_with_template_uses_list_syntax(self) -> None: + lines = self.code.splitlines() + for line in lines: + stripped = line.strip() + if "llm:chat-with-template" not in stripped: + continue + self.assertNotRegex( + stripped, + r'llm:chat-with-template\s+\S+\s+\[\[', + f"bracket syntax found instead of (list ...): {stripped}", + ) + + def test_no_inline_provider_setup_in_procedures(self) -> None: + for deprecated in ["llm:set-provider", "llm:set-api-key", "llm:set-model"]: + self.assertNotIn( + deprecated, + self.code, + f"deprecated inline primitive found: {deprecated}", + ) + + def test_all_procedure_blocks_are_closed(self) -> None: + opens = len(re.findall(r"^to(?:-report)?\s", self.code, re.MULTILINE)) + closes = len(re.findall(r"^end\s*$", self.code, re.MULTILINE)) + self.assertEqual( + opens, + closes, + f"mismatched procedure blocks: {opens} opens vs {closes} ends", + ) + + def test_no_deprecated_primitives(self) -> None: + deprecated = [ + "llm:ask", + "llm:send", + "llm:query", + "llm:prompt", + ] + for prim in deprecated: + self.assertNotIn(prim, self.code, f"deprecated primitive: {prim}") + + def test_globals_declared(self) -> None: + self.assertIn("globals [", self.code) + for g in ["llm-ready?", "config-path", "triage-template-path", + "incident-bank", "total-triaged", "correct-triage"]: + self.assertIn(g, self.code, f"missing global: {g}") + + def test_incident_bank_has_30_entries(self) -> None: + """The incident bank should contain 30 incidents (10 misleading + 10 clear + 10 borderline).""" + code = self.code + # Count (list " patterns inside build-incident-bank — each incident starts with (list " + bank_start = code.find("to build-incident-bank") + bank_end = code.find("\nend", bank_start) + bank_code = code[bank_start:bank_end] + incident_count = bank_code.count('(list "') + # The outer (list wrapping all incidents doesn't start with (list " + self.assertEqual(incident_count, 30, f"expected 30 incidents, found {incident_count}") + + +if __name__ == "__main__": + unittest.main() diff --git a/demos/crisis-triage/triage-template.yaml b/demos/crisis-triage/triage-template.yaml new file mode 100644 index 0000000..cd9745d --- /dev/null +++ b/demos/crisis-triage/triage-template.yaml @@ -0,0 +1,24 @@ +# ABOUTME: Triage template for crisis severity assessment with calibration anchors. +# ABOUTME: Used by dispatchers via llm:chat-with-template to classify incident severity. +system: | + You are a crisis triage specialist with this background: {persona} + This is episode {episode}, tick {tick} of a municipal emergency simulation. + + IMPORTANT: Do NOT rely on scary-sounding keywords alone. A "fire alarm" in a + server room may be a sensor malfunction. A "data center cooling loss" may threaten + lives if hospitals depend on it. Assess the ACTUAL described impact, not the + surface-level vocabulary. + + Severity definitions: + - LOW: No injuries, no infrastructure at risk, routine response adequate. + - MODERATE: Minor injuries or limited disruption, single-agency response sufficient. + - HIGH: Significant injuries, infrastructure at risk, or time-sensitive escalation potential. + - CRITICAL: Life-threatening, multi-agency coordination needed, cascading failures, or large population affected. + + Classify severity as exactly one of: LOW, MODERATE, HIGH, CRITICAL. +template: | + Incident: {incident} + Impact: {impact} + + Based on the described impact (not keywords), classify this incident severity. + Reply with the severity level first (LOW, MODERATE, HIGH, or CRITICAL), then a brief reason.