Skip to content

Dynamic per-stage think control for Ollama reasoning models (?think=true|false) #88

Description

@acato

Summary

The current codebase generates <think>...</think> blocks with any Ollama reasoning model (Qwen 3.x, gpt-oss, DeepSeek-R1, etc.), then post-processes them out via RemoveThinkTagFromAssistantMessages in Wrapper.py. This is correct for cleanup but expensive at generation time — the framework pays the latency cost of thinking even when the call's purpose doesn't benefit from it.

A small dynamic affordance — per-stage ?think=true|false in the model URL — would let users (and the framework) get the best of both behaviors with no breaking change.

Background

Reasoning models default-on to thinking mode unless the chat API explicitly passes think: false. With the framework's URL parameter system (ollama://model?param=value) this should be a natural fit, except for one constraint: the existing URL parameter parser hard-casts every value through float():

QueryParams = parse_qs(parsed.query)
# Flatten QueryParams

# Flatten QueryParams
for key in QueryParams:
    QueryParams[key] = float(QueryParams[key][0])

So ollama://qwen3.5:122b-a10b?think=false raises ValueError: could not convert string to float: 'false' before the value ever reaches the chat call. Booleans (and strings) need a small parser upgrade before this feature is possible.

Motivation — why dynamic rather than a single Config flag

Different pipeline stages benefit from different think settings:

Stage Want thinking? Why
Outline generation, story elements, per-chapter outlines off Wastes tokens on internal reasoning the framework cannot see; output is the artifact.
Scene-by-scene chapter writing (Stage 1) off Prose generation. Tokens should produce prose, not silent thinking.
Character / dialogue passes (Stage 2/3) off Same — refinement, prose-quality.
LLMSummaryCheck (chapter ↔ outline alignment verdict) on Structured judgment, true/false outcome. Thinking improves calibration.
GetFeedbackOnChapter, GetChapterRating (Stage 5 critique loop) on Critique is the canonical reasoning task; thinking lifts quality.
Eval / Info / Scrub structured-output passes on Same — better verdicts when the model is allowed to reason internally.

A single global flag misses this. A per-stage flag in Config.py (one per *_MODEL) bloats the config surface. A per-URL ?think=true|false rides on the existing URL-as-config mechanism, defaults to model-native behavior when omitted, and lets users mix-and-match across the 13 model knobs already in Config.py.

This also dovetails with a related architecture choice many users land on for v1 of dual-model setups: a fast generation model with thinking off (Qwen 3.x 122B, Mistral Medium 3.5, etc.) for the writing knobs, and a thinking-capable critic model (gpt-oss 120b, DeepSeek-R1 distills) for EVAL_MODEL / REVISION_MODEL / CHECKER_MODEL. The framework already supports per-stage model selection; per-stage ?think completes the picture.

Proposed implementation

Two small edits in Writer/Interface/Wrapper.py:

1. Smarter URL query parameter parser

Replace the float-only cast with bool → float → raw-string coercion so ?think=true, ?think=false, ?temperature=0.8, and (future) string parameters all work:

for key in QueryParams:
    raw = QueryParams[key][0]
    if raw.lower() == 'true':
        QueryParams[key] = True
    elif raw.lower() == 'false':
        QueryParams[key] = False
    else:
        try:
            QueryParams[key] = float(raw)
        except ValueError:
            QueryParams[key] = raw

2. Pop think from ModelOptions and pass it to the Ollama chat call

think is a chat-level kwarg in the ollama Python client, not a model option, so it has to live outside the options= dict and the ValidParameters allowlist:

# Extract chat-level kwargs (separate from Ollama model options)
ChatExtras = {}
if "think" in ModelOptions:
    ChatExtras["think"] = bool(ModelOptions.pop("think"))

# ... existing ValidParameters check / num_ctx default / JSON-mode handling ...

Stream = self.Clients[_Model].chat(
    model=ProviderModel,
    messages=_Messages,
    stream=True,
    options=ModelOptions,
    **ChatExtras,
)

Behavioral contract

  • No ?think= in URL: behavior unchanged. Default Ollama / model behavior preserved. No breaking change for anyone.
  • ?think=false: thinking suppressed; <think> blocks not generated; post-hoc strip becomes a no-op. Faster generation, no quality loss for non-reasoning tasks.
  • ?think=true: thinking forced on (useful when overriding a model that defaults off, e.g., gpt-oss when used as a critic).

Compatibility

  • Requires ollama Python client ≥ 0.4 (introduced the think kwarg). Current requirements.txt is unpinned at ollama; latest stable already supports this.
  • No change to any non-Ollama provider path. The smarter parameter parser is provider-agnostic (it runs before the provider branch) but is strictly more permissive than the current parser — any URL that worked before still works.
  • No change to Config.py defaults. Users opt in by adding ?think=true|false to whichever model URLs they want.

PR to follow.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions