Summary
The current codebase generates <think>...</think> blocks with any Ollama reasoning model (Qwen 3.x, gpt-oss, DeepSeek-R1, etc.), then post-processes them out via RemoveThinkTagFromAssistantMessages in Wrapper.py. This is correct for cleanup but expensive at generation time — the framework pays the latency cost of thinking even when the call's purpose doesn't benefit from it.
A small dynamic affordance — per-stage ?think=true|false in the model URL — would let users (and the framework) get the best of both behaviors with no breaking change.
Background
Reasoning models default-on to thinking mode unless the chat API explicitly passes think: false. With the framework's URL parameter system (ollama://model?param=value) this should be a natural fit, except for one constraint: the existing URL parameter parser hard-casts every value through float():
|
QueryParams = parse_qs(parsed.query) |
|
|
|
# Flatten QueryParams |
# Flatten QueryParams
for key in QueryParams:
QueryParams[key] = float(QueryParams[key][0])
So ollama://qwen3.5:122b-a10b?think=false raises ValueError: could not convert string to float: 'false' before the value ever reaches the chat call. Booleans (and strings) need a small parser upgrade before this feature is possible.
Motivation — why dynamic rather than a single Config flag
Different pipeline stages benefit from different think settings:
| Stage |
Want thinking? |
Why |
| Outline generation, story elements, per-chapter outlines |
off |
Wastes tokens on internal reasoning the framework cannot see; output is the artifact. |
| Scene-by-scene chapter writing (Stage 1) |
off |
Prose generation. Tokens should produce prose, not silent thinking. |
| Character / dialogue passes (Stage 2/3) |
off |
Same — refinement, prose-quality. |
LLMSummaryCheck (chapter ↔ outline alignment verdict) |
on |
Structured judgment, true/false outcome. Thinking improves calibration. |
GetFeedbackOnChapter, GetChapterRating (Stage 5 critique loop) |
on |
Critique is the canonical reasoning task; thinking lifts quality. |
| Eval / Info / Scrub structured-output passes |
on |
Same — better verdicts when the model is allowed to reason internally. |
A single global flag misses this. A per-stage flag in Config.py (one per *_MODEL) bloats the config surface. A per-URL ?think=true|false rides on the existing URL-as-config mechanism, defaults to model-native behavior when omitted, and lets users mix-and-match across the 13 model knobs already in Config.py.
This also dovetails with a related architecture choice many users land on for v1 of dual-model setups: a fast generation model with thinking off (Qwen 3.x 122B, Mistral Medium 3.5, etc.) for the writing knobs, and a thinking-capable critic model (gpt-oss 120b, DeepSeek-R1 distills) for EVAL_MODEL / REVISION_MODEL / CHECKER_MODEL. The framework already supports per-stage model selection; per-stage ?think completes the picture.
Proposed implementation
Two small edits in Writer/Interface/Wrapper.py:
1. Smarter URL query parameter parser
Replace the float-only cast with bool → float → raw-string coercion so ?think=true, ?think=false, ?temperature=0.8, and (future) string parameters all work:
for key in QueryParams:
raw = QueryParams[key][0]
if raw.lower() == 'true':
QueryParams[key] = True
elif raw.lower() == 'false':
QueryParams[key] = False
else:
try:
QueryParams[key] = float(raw)
except ValueError:
QueryParams[key] = raw
2. Pop think from ModelOptions and pass it to the Ollama chat call
think is a chat-level kwarg in the ollama Python client, not a model option, so it has to live outside the options= dict and the ValidParameters allowlist:
# Extract chat-level kwargs (separate from Ollama model options)
ChatExtras = {}
if "think" in ModelOptions:
ChatExtras["think"] = bool(ModelOptions.pop("think"))
# ... existing ValidParameters check / num_ctx default / JSON-mode handling ...
Stream = self.Clients[_Model].chat(
model=ProviderModel,
messages=_Messages,
stream=True,
options=ModelOptions,
**ChatExtras,
)
Behavioral contract
- No
?think= in URL: behavior unchanged. Default Ollama / model behavior preserved. No breaking change for anyone.
?think=false: thinking suppressed; <think> blocks not generated; post-hoc strip becomes a no-op. Faster generation, no quality loss for non-reasoning tasks.
?think=true: thinking forced on (useful when overriding a model that defaults off, e.g., gpt-oss when used as a critic).
Compatibility
- Requires
ollama Python client ≥ 0.4 (introduced the think kwarg). Current requirements.txt is unpinned at ollama; latest stable already supports this.
- No change to any non-Ollama provider path. The smarter parameter parser is provider-agnostic (it runs before the provider branch) but is strictly more permissive than the current parser — any URL that worked before still works.
- No change to
Config.py defaults. Users opt in by adding ?think=true|false to whichever model URLs they want.
PR to follow.
Summary
The current codebase generates
<think>...</think>blocks with any Ollama reasoning model (Qwen 3.x, gpt-oss, DeepSeek-R1, etc.), then post-processes them out viaRemoveThinkTagFromAssistantMessagesinWrapper.py. This is correct for cleanup but expensive at generation time — the framework pays the latency cost of thinking even when the call's purpose doesn't benefit from it.A small dynamic affordance — per-stage
?think=true|falsein the model URL — would let users (and the framework) get the best of both behaviors with no breaking change.Background
Reasoning models default-on to thinking mode unless the chat API explicitly passes
think: false. With the framework's URL parameter system (ollama://model?param=value) this should be a natural fit, except for one constraint: the existing URL parameter parser hard-casts every value throughfloat():AIStoryWriter/Writer/Interface/Wrapper.py
Lines 506 to 508 in 161b712
So
ollama://qwen3.5:122b-a10b?think=falseraisesValueError: could not convert string to float: 'false'before the value ever reaches the chat call. Booleans (and strings) need a small parser upgrade before this feature is possible.Motivation — why dynamic rather than a single Config flag
Different pipeline stages benefit from different
thinksettings:LLMSummaryCheck(chapter ↔ outline alignment verdict)GetFeedbackOnChapter,GetChapterRating(Stage 5 critique loop)A single global flag misses this. A per-stage flag in
Config.py(one per*_MODEL) bloats the config surface. A per-URL?think=true|falserides on the existing URL-as-config mechanism, defaults to model-native behavior when omitted, and lets users mix-and-match across the 13 model knobs already inConfig.py.This also dovetails with a related architecture choice many users land on for v1 of dual-model setups: a fast generation model with thinking off (Qwen 3.x 122B, Mistral Medium 3.5, etc.) for the writing knobs, and a thinking-capable critic model (gpt-oss 120b, DeepSeek-R1 distills) for
EVAL_MODEL/REVISION_MODEL/CHECKER_MODEL. The framework already supports per-stage model selection; per-stage?thinkcompletes the picture.Proposed implementation
Two small edits in
Writer/Interface/Wrapper.py:1. Smarter URL query parameter parser
Replace the
float-only cast with bool → float → raw-string coercion so?think=true,?think=false,?temperature=0.8, and (future) string parameters all work:2. Pop
thinkfromModelOptionsand pass it to the Ollama chat callthinkis a chat-level kwarg in theollamaPython client, not a model option, so it has to live outside theoptions=dict and theValidParametersallowlist:Behavioral contract
?think=in URL: behavior unchanged. Default Ollama / model behavior preserved. No breaking change for anyone.?think=false: thinking suppressed;<think>blocks not generated; post-hoc strip becomes a no-op. Faster generation, no quality loss for non-reasoning tasks.?think=true: thinking forced on (useful when overriding a model that defaults off, e.g., gpt-oss when used as a critic).Compatibility
ollamaPython client ≥ 0.4 (introduced thethinkkwarg). Currentrequirements.txtis unpinned atollama; latest stable already supports this.Config.pydefaults. Users opt in by adding?think=true|falseto whichever model URLs they want.PR to follow.