Skip to content

feat(provider): adopt Gemma4 vMLX decode stack#470

Merged
Gajesh2007 merged 9 commits into
masterfrom
feat/gemma4-vmlx-decode-port
Jun 26, 2026
Merged

feat(provider): adopt Gemma4 vMLX decode stack#470
Gajesh2007 merged 9 commits into
masterfrom
feat/gemma4-vmlx-decode-port

Conversation

@Gajesh2007

@Gajesh2007 Gajesh2007 commented Jun 26, 2026

Copy link
Copy Markdown
Member

Summary

  • Points libs/mlx-swift-lm at Layr-Labs/mlx-swift-lm@main after feat: inherit Gemma4 vMLX decode stack mlx-swift-lm#56.
  • Points libs/mlx-swift at e20ea3d (feat/inherit-upstream-2026-06) intentionally, not main, because mlx-swift@main currently declares swift-tools-version: 6.3;(experimentalCGen) while d-inference is pinned to Swift 6.1.
  • Adds a gated provider-side Gemma4DecodeProfileTests live benchmark for Gemma-4-26B-A4B raw B=1 decode and optional text sample output.

Results

  • Integrated local package release benchmark: 79.5 tok/s on mlx-community/gemma-4-26b-a4b-it-8bit (M4 Max, B=1, 128 generated tokens).
  • 1k/1k real-content benchmark: 75.16 tok/s decode, 0.9315 ms/token prefill, coherent 4.7k-char output.
  • Prior local provider stack raw baseline: ~39 TPS.
  • Python mlx-vlm on the same machine/model: 69.7 TPS.
  • Real text sample was coherent with DARKBLOOM_GEMMA_PRINT_TEXT=1.

Before

flowchart LR
  A[Provider loads Gemma4 VLM] --> B[old submodule commits]
  B --> C[old Gemma4 VLM/MoE/RoPE path]
  C --> D[raw B=1 decode]
  D --> E[~39 TPS]
Loading

After

flowchart LR
  A[Provider loads Gemma4 VLM] --> B[mlx-swift e20ea3d + mlx-swift-lm main]
  B --> C[vMLX-derived Gemma4 VLM hot path]
  C --> D[compiled GLU helpers + weighted expert combine + KV compile foundation]
  D --> E[raw B=1 decode]
  E --> F[79.5 TPS release raw bench]
Loading

Verification

  • cd /tmp/dbval/gemma4rawbench && swift run -c release Gemma4RawBench against this repo's local packages: 79.5 TPS.
  • 1k/1k benchmark artifacts: /tmp/dbval/gemma4_1k_output.txt, /tmp/dbval/gemma4_1k_prompt_ids.json, /tmp/dbval/gemma4_1k_output_ids.json.
  • cd /tmp/dbval/d-inference-pr-main/provider-swift && swift build --build-tests: passed after cleaning stale SwiftPM cache.
  • Gated provider test added: DARKBLOOM_LIVE_MLX_TESTS=1 DARKBLOOM_LIVE_MLX_GEMMA=1 swift test --filter Gemma4DecodeProfileTests.

Notes

  • mlx-swift@main should be adopted later only after either d-inference moves to Swift 6.3 or mlx-swift main restores a Swift-6.1-compatible manifest. This PR keeps the documented Swift 6.1 toolchain path working.
  • Release provider swift test -c release remains blocked by an unrelated kv-se-harness compile issue, so the fair release throughput number was measured with a small temp executable depending on the same local packages.

@vercel

vercel Bot commented Jun 26, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
d-inference Ready Ready Preview Jun 26, 2026 8:24pm
d-inference-console-ui-dev Ready Ready Preview Jun 26, 2026 8:24pm
d-inference-landing Ready Ready Preview Jun 26, 2026 8:24pm

Request Review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 269e7fc751

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +25 to +27
if LiveInferenceFixtures.ensureMetallibColocated() == nil {
Issue.record("mlx.metallib not found near test bundle or in MLX_METALLIB_PATH/SOURCE")
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Return when metallib bootstrap fails

When this live profile is enabled on a machine without mlx.metallib colocated, the test records an issue but then continues into MLX setup/model execution. LiveInferenceFixtures.ensureMetallibColocated() is documented and used elsewhere as a guard because MLX can crash on the first GPU call without the metallib, so this intended skip/failure path can abort the whole filtered benchmark run instead of exiting cleanly.

Useful? React with 👍 / 👎.

Comment thread libs/mlx-swift-lm Outdated
@@ -1 +1 @@
Subproject commit 2b4b0d8dd123b7fde4f8f5fbcfd4fe8e3e4fe242
Subproject commit 461a0ab01c42fdae4d0b400259203ae3c7009224

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restore Gemma4 video preprocessing

For Gemma4 video_url requests, this submodule bump drops the processor path that consumed input.videos: the previous Gemma4.swift built LMInput.ProcessedVideo and expanded <|video|> placeholders, while the new Gemma4.swift has no input.videos/ProcessedVideo handling. VLMRequestInference.buildUserInput still attaches videos and routes media requests through the container's processor, so advertised Gemma4 video inputs will be sent to generation without frame features (or hit placeholder/feature mismatches) rather than being answered from the video.

Useful? React with 👍 / 👎.

@ethenotethan ethenotethan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Code Review — Layr-Labs/d-inference#

Verdict: COMMENT

Security — ✅ No issues found

Performance — 1 finding(s)

  • 🔵 [INFO] provider-swift/Tests/ProviderCoreTests/Gemma4DecodeProfileTests.swift:93 — Array grows in loop without pre-allocation despite reserveCapacity
    • Suggestion: Move reserveCapacity(64) before the loop or use a fixed-size array since the loop count is known (64 iterations)

Type_diligence — ✅ No issues found

Additive_complexity — ✅ No issues found

1 finding(s) total, 0 blocking. Verdict: COMMENT.

🤖 Automated review by Centaur · DAR-186

@ethenotethan ethenotethan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Code Review — Layr-Labs/d-inference#

Verdict: COMMENT

Security — ✅ No issues found

Performance — 1 finding(s)

  • 🔵 [INFO] provider-swift/Tests/ProviderCoreTests/Gemma4DecodeProfileTests.swift:93 — Array grows in loop without pre-allocation despite known capacity
    • Suggestion: Move reserveCapacity(64) before the loop or initialize with capacity: var generated = [Int](reserving: 64)

Type_diligence — ✅ No issues found

Additive_complexity — ✅ No issues found

1 finding(s) total, 0 blocking. Verdict: COMMENT.

🤖 Automated review by Centaur · DAR-186

@Gajesh2007 Gajesh2007 force-pushed the feat/gemma4-vmlx-decode-port branch from 182b719 to e0d0c75 Compare June 26, 2026 06:35
@Gajesh2007 Gajesh2007 changed the base branch from feat/add-libs-mlx-submodule to master June 26, 2026 06:35
@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown

This PR introduces a performance-oriented B=1 greedy fast-path for single exclusive inference requests; the security posture is neutral-to-slightly-positive with one new attack surface item worth tracking.


Trust Boundaries Touched

  • TB-002 (Coordinator → Provider WebSocket) — billing-safe admission accounting, KV budget
  • TB-003 (Provider Operator vs Process) — new GPU execution path, cancellation/teardown
  • TB-007 (Provider Inference Engine) — primary surface: new code path for token generation, KV reservation, cancellation, model-teardown fencing

Threat Analysis

T-007 / T-027 — Model weight substitution / manipulated outputs
ℹ️ Neutral. The fast path calls ModelContainer.generate — the same container the VLM path already uses. It doesn't bypass weight loading, model identity, or the hash advertised to the coordinator. No change to the weight-hash advertisement or coordinator enforcement (SEC-007 remains open).

T-008 — Provider sends plaintext SSE chunks on encryption failure (SEC-016)
ℹ️ Neutral. The fast path produces GenerationEvent values that flow through the same downstream encryption layer the batched engine uses. The truncated diff doesn't show runGreedyFastPath's terminal event handling in full, but the continuation.yield(.error(...)) rejection patterns are consistent with fail-closed semantics. Reviewers should confirm that runGreedyFastPath (the body not shown in the truncated portion of BatchScheduler+B1FastPath.swift) emits .error — not a plaintext data chunk — on any encryption or generation failure, and does not populate a plaintext data field (mirroring the existing SEC-016 risk in ProviderLoop.swift:436–451).

T-028 — Residual inference data in GPU memory
ℹ️ Neutral, with a minor note. stopCurrentEngine now calls await waitForFastPathTasks() before MLX.Memory.clearCache() (BatchScheduler.swift, lines ~414–428). This fence is correct and security-relevant: without it, a fast-path task still holding a live GPU KV cache could race the clearCache() call, leaving residual prompt tensors in an inconsistent state. The fence tightens teardown relative to the status quo. The underlying open issue (GPU buffers not explicitly zeroed between requests at the Metal level) is unchanged.

T-041 — Cross-tenant prefix-cache sharing / TTFT timing oracle
ℹ️ Neutral. The fast path is Gemma-only and temperature-0 (greedy), so prefix-cache interactions are scoped. KV reservation/release goes through reserveKVForRequest / releaseRequestResources — the same accounting path — so the budget tracking that guards cache capacity is preserved. The fast path does not introduce new sharing semantics.

T-010 — Cancellation not propagated to inference engine
✅ Strengthens. cancel(requestId:) now checks cancelFastPathTask first and returns early (BatchScheduler+Submit.swift:491–496). cancelAll() calls cancelAllFastPathTasks() before the engine abort (BatchScheduler+Submit.swift:530–535). Fast-path tasks are explicitly torn down in stopCurrentEngine via waitForFastPathTasks() (BatchScheduler.swift:414–428). This is a meaningful improvement in cancellation coverage for the new execution path.


New Attack Surface Not Covered by an Existing Threat

Env-flag side-channel for operational state inference (BatchScheduler+B1FastPath.swift:45–48)

b1GreedyFastPathEnabled() reads ProcessInfo.processInfo.environment on every eligibility check ("cheap" per the comment). This is fine for a provider-operator–controlled process. However, the rejection path for non-fast-path requests while a fast-path task is active surfaces a new observable timing/error channel to the consumer:

"token_budget_exhausted: a single-request fast path is active; retry shortly"

(BatchScheduler+Submit.swift:107–111 and BatchScheduler+Submit.swift:300–306)

This error string leaks to the consumer that the provider is currently processing exactly one exclusive greedy request. Under the threat model (ADV-002, malicious consumer), a consumer who can probe retry behavior could infer:

  • Whether the fast path is enabled on this provider (feature fingerprinting).
  • Whether another consumer is actively mid-inference (presence oracle, similar in kind to the TTFT timing channel in T-041).

Recommendation: use a generic retryable error string (e.g. the existing "token_budget_exhausted" without the explanatory suffix) or the same opaque message the KV-budget rejection uses. The informational suffix has no consumer-facing value and narrows the ambiguity consumers currently have about provider internal state.


Open Findings Resolved

None. SEC-007, SEC-016, SEC-035 remain open and are untouched by this PR.


Summary

The GPU teardown fence (waitForFastPathTasks before clearCache) and the explicit fast-path cancellation wiring are the two concrete security improvements here. The one actionable item is the overly descriptive rejection error string that leaks fast-path execution state to consumers — suggest trimming it to a generic token_budget_exhausted before merge.


🔐 Threat model: docs/threat-model.yaml · Updates on each push to this PR

@ethenotethan ethenotethan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Code Review — Layr-Labs/d-inference#

Verdict: COMMENT

Security — ✅ No issues found

Performance — 1 finding(s)

  • 🔵 [INFO] provider-swift/Tests/ProviderCoreTests/Gemma4DecodeProfileTests.swift:94 — Array grows in loop without pre-allocation despite reserveCapacity
    • Suggestion: Move reserveCapacity(64) before the loop or use a fixed-size array since the loop count is known (64 iterations)

Type_diligence — ✅ No issues found

Additive_complexity — ✅ No issues found

1 finding(s) total, 0 blocking. Verdict: COMMENT.

🤖 Automated review by Centaur · DAR-186

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e0d0c75412

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread libs/mlx-swift Outdated
@@ -1 +1 @@
Subproject commit 3c50ad693a7a3fbfbcc1d7ddf834a37e9ee8cf14
Subproject commit 3b6af6e26b009995e127c9c1921f185c54aed379

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep mlx-swift compatible with the pinned Swift toolchain

This gitlink now points the provider's local ../libs/mlx-swift dependency at a commit whose Package.swift declares // swift-tools-version: 6.3;(experimentalCGen), while the repo still documents and pins Swift 6.1 in mise.toml and provider-swift/Package.swift. Developers following the documented mise install && make provider-build path, and any CI/release runner with Swift <6.3, will fail during package loading with a minimum tools-version error before any provider code compiles; either keep this submodule on a Swift-6.1-compatible commit or update the repo/toolchain/workflows together.

Useful? React with 👍 / 👎.

Comment thread libs/mlx-swift-lm Outdated
@@ -1 +1 @@
Subproject commit b1df0f22424f94005685baf79192ff08e9c43eb9
Subproject commit 7ce9e45b40add0d6e66bc69aeed6bfb896009e04

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve Gemma4's bidirectional vision mask

For Gemma4 image_url requests, this submodule bump removes the VLM text path that built a visualTokenMask and passed it as imageTokenMask so use_bidirectional_attention == "vision" could overlay bidirectional attention within image-token spans; the new Gemma4 path has no imageTokenMask/useBidirectionalAttention handling and always builds ordinary causal masks. That means earlier soft image tokens cannot attend to later patches during prefill, corrupting the visual features the language model consumes even though image inputs are still accepted.

Useful? React with 👍 / 👎.

Comment thread libs/mlx-swift-lm Outdated
@@ -1 +1 @@
Subproject commit b1df0f22424f94005685baf79192ff08e9c43eb9
Subproject commit 7ce9e45b40add0d6e66bc69aeed6bfb896009e04

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep Gemma4 visual delimiter tokens

For Gemma4 image prompts, the previous processor expanded each <|image|> placeholder as boi + image_token * count + eoi, matching the model config's boi_token_id/eoi_token_id; this submodule bump replaces that with only repeated image tokens and the new config no longer reads those delimiter ids. Requests that include image_url are therefore still routed to Gemma4 but the visual span is missing the boundary tokens the model was trained to see, so image understanding can degrade or fail even when the feature count matches.

Useful? React with 👍 / 👎.

@Gajesh2007

Copy link
Copy Markdown
Member Author

Provider-stack Gemma4 benchmark

Ran a release executable depending on ProviderCore and the PR's local packages, using the real BatchScheduler path (loadModel + submitTokenized) with Gemma-4-26B-A4B 8-bit.

Metric Value
Prompt tokens 973
Requested output 512 tokens
TTFT (submit → first chunk) 2180.0 ms
Total request time 10586.8 ms
Completion chunks 512
Decode window 8389.6 ms
Measured decode TPS 60.91 tok/s
Scheduler-reported TPS 60.90 tok/s

Notes:

  • This is the full d-inference provider stack, not the raw model loop.
  • Raw release model loop remains ~79.5 tok/s; provider stack overhead + streaming + ~1k prompt KV growth bring end-to-end decode to ~61 tok/s for this long prompt.
  • Output was coherent long-form technical prose; sample saved at /tmp/dbval/provider_gemma4_stack_output.txt.

Excerpt:

Architectural Optimization of Sparse Mixture-of-Experts Inference on Apple Silicon\n\nThe deployment of Sparse Mixture-of-Experts (SMoE) models, such as Mixtral 8x7B, presents a unique computational paradox on Unified Memory Architecture (UMA) systems like Apple Silicon...

@ethenotethan ethenotethan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Code Review — Layr-Labs/d-inference#

Verdict: COMMENT

Security — ✅ No issues found

Performance — 1 finding(s)

  • 🔵 [INFO] provider-swift/Tests/ProviderCoreTests/Gemma4DecodeProfileTests.swift:94 — Array grows in loop without pre-allocation despite reserveCapacity
    • Suggestion: Move reserveCapacity(64) before the loop or use a fixed-size array since the loop count is known (64 iterations)

Type_diligence — ✅ No issues found

Additive_complexity — ✅ No issues found

1 finding(s) total, 0 blocking. Verdict: COMMENT.

🤖 Automated review by Centaur · DAR-186

@ethenotethan ethenotethan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Code Review — Layr-Labs/d-inference#

Verdict: REQUEST_CHANGES

Security — ✅ No issues found

Performance — 2 finding(s) (1 blocking)

  • 🔵 [INFO] provider-swift/Tests/ProviderCoreTests/ContinuousBatchingLiveTests.swift:235 — Array.prefix(while:) creates new array copies in hot path
    • Suggestion: Use iterators or indices to avoid copying arrays when checking for EOS tokens and computing prefix matches
  • 🟡 [MEDIUM] provider-swift/Tests/ProviderCoreTests/Gemma4DecodeProfileTests.swift:94-95 — Array grows in loop without pre-allocation despite reserveCapacity
    • Suggestion: Use generated.append(token) directly since capacity is already reserved, or pre-allocate with Array(repeating:count:) and use index-based assignment

Type_diligence — ✅ No issues found

Additive_complexity — ✅ No issues found

2 finding(s) total, 1 blocking. Verdict: REQUEST_CHANGES.

🤖 Automated review by Centaur · DAR-186

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e0b67c5f45

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread libs/mlx-swift-lm Outdated
@@ -1 +1 @@
Subproject commit b1df0f22424f94005685baf79192ff08e9c43eb9
Subproject commit e49e0ff0e43aee0e9e554e1a9120f62f00bcf09c

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reserve Gemma video KV for every sampled frame

With this submodule bump, Gemma4 video preprocessing now samples up to 32 frames and expands each <|video|> to config.imageSeqLength soft tokens per frame, but the provider still reserves a flat 4,096 KV tokens per video in VLMRequestInference.projectedKVTokens (visionTokensPerVideo and the .videoURL branch). For Gemma configs with ~256/280 soft tokens per frame, a full 32-frame clip needs roughly 8–9k vision tokens before text/output, so concurrent video requests can pass the 90% KV cap with about half their actual prompt KV unreserved and then OOM or destabilize the provider during generation.

Useful? React with 👍 / 👎.

)
)
func gemma4VLMLongContextMixedB3() async throws {
try ensureMetallibAvailable()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Stop after missing metallib in live tests

When these new Gemma live tests are enabled on a machine without a colocated mlx.metallib, ensureMetallibAvailable() only records an issue and returns; execution then immediately proceeds into MLX.GPU.set and model loading. LiveInferenceFixtures.ensureMetallibColocated() documents that callers should skip instead of reaching the first GPU call, and the new Gemma4DecodeProfileTests path does return on nil, so these added ContinuousBatching tests can still abort a filtered benchmark run rather than exiting cleanly.

Useful? React with 👍 / 👎.

@ethenotethan ethenotethan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Code Review — Layr-Labs/d-inference#

Verdict: COMMENT

Security — ✅ No issues found

Performance — ✅ No issues found

Type_diligence — ✅ No issues found

Additive_complexity — ✅ No issues found

✅ All four passes clean. No issues found.

🤖 Automated review by Centaur · DAR-186

@Gajesh2007

Copy link
Copy Markdown
Member Author

Provider B=1 fast path benchmark

Added an env-gated B=1 greedy fast path (DARKBLOOM_B1_GREEDY_FAST_PATH=1 or DARKBLOOM_GEMMA_B1_FAST_PATH=1) that bypasses continuous batching for one exclusive greedy request and uses the proven single-stream ModelContainer.generate path.

Corrected PR pins during this run:

  • libs/mlx-swift: e20ea3d
  • libs/mlx-swift-lm: 791c17f

Release provider-stack result (Gemma-4-26B-A4B 8-bit)

Mode Decode TPS Decode window Prompt Output
Batched engine + fast BatchRotatingKV ~63 tok/s ~8.1s 973 512
B=1 greedy fast path 76.48 tok/s 6694.8 ms 973 512

Other details:

  • TTFT: 811.2 ms
  • Total: 7509.0 ms
  • Scheduler-reported TPS: 76.45 tok/s
  • Output remained coherent.
  • Log: /tmp/dbval/provider_gemma4_b1fast_clean_rerun.log

This closes the B=1 provider path gap to the raw model loop (~75-79 tok/s) while keeping the normal batched engine path as the default unless explicitly enabled by env.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6819a45854

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +116 to +120
let kvOutcome = await reserveKVForRequest(
requestId: id,
requestTokens: requestBudget,
reservationTokens: requestBudget,
restorePlanned: false

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reserve fp16 KV for the fast path

When the B=1 path is enabled on a KV-quantized model, this branch reserves through reserveKVForRequest, which charges kvBytesPerToken for the batched/quantized cache. The request then runs through ModelContainer.generate, the same non-batched path that reserveVisionRequest explicitly accounts with fp16KVBytesPerToken; as a result the fast-path request can hold substantially more KV than it reserved and let later admissions pass the 90% cap, causing OOMs under concurrent traffic. Use the fp16 byte reservation or disable this path when KV quantization is active.

Useful? React with 👍 / 👎.

Task { await scheduler.cancel(requestId: id) }
}
}
return stream

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep fast-path requests in the concurrency gate

This returns before planner.admit, so the planner never counts the direct ModelContainer.generate request as active. While that request is decoding, a later non-fast-path submit can still be queued and scheduled because BatchQueuePlanner sees active.count == 0 even when maxConcurrentRequests == 1, which breaks the branch's “single exclusive” assumption and can overlap engine work with the fast path. Either register the fast-path request with the planner or block new engine admissions until it finishes.

Useful? React with 👍 / 👎.

Comment on lines +422 to +423
cancelAllFastPathTasks()
fastPathTasks.removeAll()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Wait for fast-path tasks before clearing MLX state

On unload/reload or a liveness self-restart during a fast-path generation, this only cancels the task and immediately drops the handle before teardown continues to nil the container and clear MLX memory. Swift cancellation is cooperative, so the task may still be inside container.generate or waiting for its next event while the old model/KV memory is being released, risking crashes or use-after-teardown; keep the handles and await/fence those tasks before clearing model resources.

Useful? React with 👍 / 👎.

let lmInput = LMInput(tokens: MLXArray(promptTokens))
// temperature 0 ⇒ ArgMaxSampler. topP/topK/penalties left at their
// defaults are inert under greedy. maxTokens bounds the decode.
let params = GenerateParameters(maxTokens: maxTokens, temperature: 0)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Pass effective EOS tokens through the fast path

When the fast path is enabled for models whose stop tokens are augmented at load time, such as GPT-OSS/Harmony via effectiveEOSTokenIds, this direct ModelContainer.generate call does not receive the same EOS set that the batched Scheduler is built with. Those requests can therefore continue past model-specific return/call/end tokens until maxTokens or leak control tokens, while the engine path would stop correctly; thread the effective EOS set into the fast path or restrict it to models whose container config already contains the full stop set.

Useful? React with 👍 / 👎.

Comment on lines +183 to +187
case .chunk(let text):
if !sawFirstToken {
sawFirstToken = true
await scheduler.recordFirstToken(requestId: id, at: .now)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Count streamed tokens before cancellation

If a client disconnects after receiving chunks but before ModelContainer.generate emits its terminal .info, the cancellation path breaks out with completionTokens still at 0 because chunk handling only records first-token timing. recordFinish then reports/bills zero completion tokens even though text was already delivered; update progress as chunks arrive or otherwise preserve delivered token counts before honoring cancellation.

Useful? React with 👍 / 👎.

Comment on lines +99 to +103
topK: topK,
seed: seed,
maxTokens: maxTokens,
cacheScope: cacheScope
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve context-window checks on the fast path

When the fast path is enabled, eligibility is decided without the prompt length or maxContextLength, and this branch returns before the planner's maxTokensPerBatch/context-window rejection runs. A greedy request whose prompt is longer than the model context but still fits the memory token budget can therefore be sent directly to ModelContainer.generate instead of producing the deterministic context error the engine path emits, risking runtime failures or malformed output for over-context prompts.

Useful? React with 👍 / 👎.

@ethenotethan ethenotethan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Code Review — Layr-Labs/d-inference#

Verdict: REQUEST_CHANGES

Security — 1 finding(s)

  • 🔵 [INFO] provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:47-49 — Environment variable values read without validation
    • Suggestion: Validate environment variable values are exactly "1" rather than using == comparison which could match unintended values

Performance — 2 finding(s) (1 blocking)

  • 🟡 [MEDIUM] provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:139 — Synchronous container.generate call blocks the actor on hot path
    • Suggestion: Move the container.generate call outside the actor context or use async/await properly to avoid blocking the scheduler actor
  • 🔵 [INFO] provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:155-175 — Unbounded string accumulation in generation loop
    • Suggestion: Pre-allocate string buffer capacity or use streaming approach to avoid repeated string reallocations during text generation

Type_diligence — ✅ No issues found

Additive_complexity — 1 finding(s) (1 blocking)

  • 🟡 [MEDIUM] provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:1-279 — New 279-line fast path implementation duplicates existing functionality
    • Suggestion: Consider if this optimization justifies the added complexity. The fast path reimplements generation logic that already exists in the batched engine, adding ~300 lines of code and new state tracking for a performance optimization that only applies to single greedy requests.

4 finding(s) total, 2 blocking. Verdict: REQUEST_CHANGES.

🤖 Automated review by Centaur · DAR-186

@Gajesh2007

Copy link
Copy Markdown
Member Author

Review fixes: media KV reservation + live metallib guards

Addressed two provider-side review findings:

  1. Reserve Gemma video KV for every sampled frame

    • VLMRequestInference now reserves for maxVideoFramesSampled (32) * visionTokensPerImage instead of a flat 4096 tokens.
    • This safely covers Gemma4's video path where each sampled frame expands into a visual soft-token block.
    • Added VLMCapReservationTests.projectedKVTokensReservesEverySampledVideoFrame.
  2. Stop after missing metallib in live tests

    • ContinuousBatchingLiveTests.ensureMetallibAvailable() now returns false on missing metallib, and all call sites return before GPU setup/model loading.

Verification:

  • swift build --build-tests: passed.
  • swift test --filter VLMCapReservationTests: 14/14 passed.
  • swift test --filter VLMRequestInferenceTests: 44/44 passed.

@ethenotethan ethenotethan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Code Review — Layr-Labs/d-inference#

Verdict: REQUEST_CHANGES

Security — 1 finding(s)

  • 🔵 [INFO] provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:47-49 — Environment variable values read without validation
    • Suggestion: Validate environment variable values are exactly "1" rather than using string equality check that could match unintended values

Performance — 2 finding(s) (1 blocking)

  • 🟡 [MEDIUM] provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:158 — Synchronous container.generate call blocks the actor on hot path
    • Suggestion: Use async/await pattern or move the blocking operation off the actor's execution context
  • 🔵 [INFO] provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:225 — Dictionary grows without bounds as fast path tasks accumulate
    • Suggestion: Add periodic cleanup or size limits to fastPathTasks dictionary to prevent memory growth

Type_diligence — ✅ No issues found

Additive_complexity — 3 finding(s) (1 blocking)

  • 🟡 [MEDIUM] provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:1-279 — New 279-line fast path adds complexity for marginal performance gain
    • Suggestion: Consider if the 20% TPS improvement justifies adding a parallel execution path with its own task tracking, eligibility logic, and test overrides. The batched engine already handles B=1 cases.
  • 🔵 [INFO] provider-swift/Sources/ProviderCore/Inference/BatchScheduler.swift:264-278 — Three new state fields for fast path tracking
    • Suggestion: The fastPathTasks dictionary and test override field add state management burden. Consider if the performance benefit outweighs the maintenance cost of parallel execution paths.
  • 🔵 [INFO] provider-swift/Tests/ProviderCoreTests/B1GreedyFastPathTests.swift:1-294 — 294-line test file for eligibility policy that could be simpler
    • Suggestion: The eligibility logic is a pure function with 9 parameters. Consider if this level of parameterization and testing is proportional to the feature's complexity.

6 finding(s) total, 2 blocking. Verdict: REQUEST_CHANGES.

🤖 Automated review by Centaur · DAR-186

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8e4f59a49d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

case .info(let info):
reportedPrompt = info.promptTokenCount
completionTokens = info.generationTokenCount
case .toolCall:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Forward fast-path tool calls

When the env-gated fast path handles a default-greedy request that includes tools, this branch drops the .toolCall events produced by ModelContainer.generate. MultiModelBatchSchedulerEngine.streamChatCompletion only builds downstream tool calls from raw .chunk text via BatchedToolStreamHandler, and submitTokenized does not pass a “has tools” flag into eligibility, so tool requests can enter this path and complete with no tool call/content. Either disqualify tool requests from the fast path or translate these events into the caller’s expected stream.

Useful? React with 👍 / 👎.

@ethenotethan ethenotethan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Code Review — Layr-Labs/d-inference#

Verdict: REQUEST_CHANGES

Security — 1 finding(s)

  • 🔵 [INFO] provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:47-49 — Environment variable values not validated
    • Suggestion: Validate environment variable values are exactly "1" rather than using string equality check that could match unintended values

Performance — 3 finding(s) (1 blocking)

  • 🟡 [MEDIUM] provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:139 — Synchronous container.generate call blocks the actor on hot path
    • Suggestion: Use async/await pattern or move the blocking operation off the actor's execution context
  • 🔵 [INFO] provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:155 — MLXArray(promptTokens) creates unbounded host-side copy
    • Suggestion: Pre-allocate or stream tokens to avoid large memory copies for very long prompts
  • 🔵 [INFO] provider-swift/Sources/ProviderCore/Inference/VLMRequestInference.swift:133 — Multiplication computed repeatedly in hot path
    • Suggestion: Pre-compute VLMRequestInference.maxVideoFramesSampled * VLMRequestInference.visionTokensPerImage as a static constant

Type_diligence — ✅ No issues found

Additive_complexity — 2 finding(s) (1 blocking)

  • 🟡 [MEDIUM] provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:1-279 — New 279-line fast path implementation duplicates existing functionality
    • Suggestion: Consider if this optimization justifies the added complexity. The fast path reimplements generation logic that already exists in the batched engine, adding ~300 lines of code and a parallel execution path that must be maintained alongside the existing engine. The performance gain (75 vs 63 TPS) may not justify doubling the inference code paths.
  • 🔵 [INFO] provider-swift/Sources/ProviderCore/Inference/BatchScheduler.swift:264-278 — Additional state tracking for parallel execution path
    • Suggestion: The fastPathTasks dictionary and _forceB1FastPathForTest flag add state management complexity. Consider if the performance benefit justifies maintaining two separate execution paths with their own lifecycle management.

6 finding(s) total, 2 blocking. Verdict: REQUEST_CHANGES.

🤖 Automated review by Centaur · DAR-186

@Gajesh2007

Copy link
Copy Markdown
Member Author

B=1 fast-path safety hardening

Addressed the B1 fast-path review findings conservatively:

  • Fast path is now ineligible when KV quantization is active.
  • Fast path is Gemma-family only.
  • Engine-path submissions are blocked while a fast-path task is active, preserving the single-exclusive assumption.
  • stopCurrentEngine now cancels and awaits fast-path tasks before model/MLX teardown.
  • Context-window eligibility added.
  • Tool-call events are not silently dropped: unexpected .toolCall fails the fast path.
  • Cancellation accounting now preserves delivered chunk count instead of billing zero after streamed output.
  • MultiModelBatchSchedulerEngine threads allowFastPath: toolHandler == nil so tool requests stay on the engine path.

Verification:

  • swift build --build-tests: passed.
  • swift test --filter B1GreedyFastPathTests: 15/15 passed.
  • Release Gemma4 provider fast-path benchmark after hardening: 76.06 tok/s (973 prompt / 512 output), coherent output.
    • Log: /tmp/dbval/provider_gemma4_b1fast_safety.log

@Gajesh2007 Gajesh2007 merged commit 9ee24d9 into master Jun 26, 2026
13 of 14 checks passed

@ethenotethan ethenotethan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Code Review — Layr-Labs/d-inference#

Verdict: REQUEST_CHANGES

Security — ✅ No issues found

Performance — 2 finding(s) (1 blocking)

  • 🟡 [MEDIUM] provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:206 — Synchronous container.generate call blocks actor on hot path
    • Suggestion: Use async/await pattern or move the blocking call off the actor's execution context
  • 🔵 [INFO] provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:370 — Task spawned without explicit cancellation handling in parent context
    • Suggestion: Consider using TaskGroup or structured concurrency to ensure proper task lifecycle management

Type_diligence — ✅ No issues found

Additive_complexity — 1 finding(s) (1 blocking)

  • 🔴 [CRITICAL] provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:1-370 — 370-line fast path implementation duplicates engine functionality with single consumer
    • Suggestion: Consider if this optimization justifies the maintenance burden of duplicating request lifecycle, KV management, and error handling that already exists in the batched engine

3 finding(s) total, 2 blocking. Verdict: REQUEST_CHANGES.

🤖 Automated review by Centaur · DAR-186

@ethenotethan ethenotethan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Code Review — Layr-Labs/d-inference#

Verdict: REQUEST_CHANGES

Security — ✅ No issues found

Performance — 3 finding(s) (1 blocking)

  • 🟡 [MEDIUM] provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:206 — Synchronous container.generate call blocks actor on hot path
    • Suggestion: Use async/await pattern or move the blocking call off the actor's execution context
  • 🔵 [INFO] provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:190 — MLXArray(promptTokens) creates unbounded allocation in hot path
    • Suggestion: Pre-allocate or reuse MLXArray instances, especially for large prompt token arrays
  • 🔵 [INFO] provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:340-350 — Task spawned without proper resource limits or cancellation timeout
    • Suggestion: Add timeout or resource limits to the spawned Task to prevent unbounded resource usage

Type_diligence — ✅ No issues found

Additive_complexity — 1 finding(s) (1 blocking)

  • 🔴 [CRITICAL] provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:1-370 — 370-line fast path implementation with single consumer adds significant complexity
    • Suggestion: Consider if this optimization justifies the maintenance burden - the fast path duplicates engine functionality, adds new failure modes, and requires careful coordination with the existing batched engine

4 finding(s) total, 2 blocking. Verdict: REQUEST_CHANGES.

🤖 Automated review by Centaur · DAR-186

await scheduler.recordAdmission(requestId: id, at: .now)

let genStream: AsyncStream<Generation>
do {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [MEDIUM] ⚡ Synchronous container.generate call blocks actor on hot path

💡 Suggestion: Use async/await pattern or move the blocking call off the actor's execution context

📊 Score: 3×4 = 12 · Category: blocking_io

promptTokens: [Int],
maxTokens: Int,
continuation: AsyncStream<GenerationEvent>.Continuation
) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 [INFO] ⚡ MLXArray(promptTokens) creates unbounded allocation in hot path

💡 Suggestion: Pre-allocate or reuse MLXArray instances, especially for large prompt token arrays

📊 Score: 2×3 = 6 · Category: unbounded_allocations

Comment on lines +340 to +350
/// The handles are snapshotted first so a self-removing `clearFastPathTask`
/// during the awaits cannot mutate the collection being iterated. The await
/// suspends the actor so those actor-isolated callbacks make progress; no NEW
/// fast path can start meanwhile because `stopCurrentEngine` has already
/// nil'd `engine` (every submit path short-circuits on a nil engine).
/// Idempotent: a no-op when nothing is in flight.
func waitForFastPathTasks() async {
let inflight = Array(fastPathTasks.values)
guard !inflight.isEmpty else { return }
for task in inflight { task.cancel() }
for task in inflight { await task.value }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 [INFO] ⚡ Task spawned without proper resource limits or cancellation timeout

💡 Suggestion: Add timeout or resource limits to the spawned Task to prevent unbounded resource usage

📊 Score: 2×3 = 6 · Category: unbounded_goroutines

Comment on lines +1 to +370
// Copyright © 2026 Eigen Labs.
//
// BatchScheduler B=1 greedy fast path: an env-gated bypass of the
// continuous-batching engine for a single exclusive, greedy (temperature == 0)
// request.
//
// The batched engine carries continuous-batching overhead a single-row decode
// does not need: batch tensor (de)allocation per step, the scheduler step loop,
// the output collector, and cross-thread `RequestOutput` streaming. On Gemma-4
// that overhead shows up as ~63 TPS through `BatchedEngine` vs ~75 TPS for the
// raw single-sequence loop (see `Tests/.../Gemma4DecodeProfileTests.swift`).
//
// When exactly one request is in flight and it is pure greedy, we run that
// single-sequence decode through `ModelContainer.generate` — the SAME
// concurrency-safe path the VLM media path (`VLMRequestInference`) already uses
// alongside the engine. `ModelContainer.generate` holds the container
// exclusively only for the prefill, then streams the decode (asyncEval
// pipelined) on its own task. We translate its `Generation` events to our
// `GenerationEvent` stream.
//
// Safety posture:
// * OFF by default; opt in with an env flag.
// * Conservative gate — anything that isn't a single exclusive greedy text
// request falls back to the batched engine, so the engine path's behavior
// is never altered.
// * KV byte budget is still reserved/released, and bridge bookkeeping
// (heartbeats, decode/prefill EWMA, billing-safe usage) is preserved via
// the SAME `recordAdmission` / `recordFirstToken` / `recordFinish` methods
// the engine bridge uses.
// * The in-flight task is tracked so `cancel` / `cancelAll` /
// `stopCurrentEngine` tear it down deterministically.

import Foundation
import MLX
import MLXLMCommon

extension BatchScheduler {

// MARK: - Env gate

/// True when the operator opted into the B=1 greedy fast path. Two flags are
/// accepted: `DARKBLOOM_B1_GREEDY_FAST_PATH` (generic) and
/// `DARKBLOOM_GEMMA_B1_FAST_PATH` (Gemma-targeted alias). Either set to `"1"`
/// enables it. Read per-call (cheap) so tests can toggle it via the
/// environment without restarting the scheduler.
static func b1GreedyFastPathEnabled() -> Bool {
let env = ProcessInfo.processInfo.environment
return env["DARKBLOOM_B1_GREEDY_FAST_PATH"] == "1"
|| env["DARKBLOOM_GEMMA_B1_FAST_PATH"] == "1"
}

// MARK: - Eligibility

/// Whether this request can take the single-exclusive greedy fast path.
///
/// MUST be evaluated BEFORE the request's own bridge is inserted into
/// `activeBridges` — the exclusivity check reads `activeBridges.count`.
/// Every condition is conservative: a miss simply defers to the batched
/// engine, so this can only shrink the set of requests the fast path serves,
/// never change the engine path's correctness. The decision itself is a pure
/// function (`b1FastPathEligiblePure`) so it can be unit-tested exhaustively
/// without a loaded model.
func b1FastPathEligible(
temperature: Float,
topP: Float?,
topK: Int?,
seed: UInt64?,
promptTokenCount: Int,
maxTokens: Int,
cacheScope: String,
allowFastPath: Bool
) -> Bool {
Self.b1FastPathEligiblePure(
// Test override wins when set; otherwise consult the env flags.
enabled: _forceB1FastPathForTest ?? Self.b1GreedyFastPathEnabled(),
allowFastPath: allowFastPath,
modelId: modelId,
kvQuantEnabled: kvQuantEnabled,
temperature: temperature,
topP: topP,
topK: topK,
seed: seed,
promptTokenCount: promptTokenCount,
maxTokens: maxTokens,
maxContextLength: maxContextLength,
cacheScope: cacheScope,
activeBridgeCount: activeBridges.count,
pendingRequestCount: pendingRequestCount,
fastPathActive: !fastPathTasks.isEmpty,
hasContainer: modelContainer != nil
)
}

/// Pure eligibility policy for the B=1 greedy fast path. No actor state — all
/// inputs are parameters — so it is fully unit-testable. Order is irrelevant
/// to the result (all conditions must hold), but kept cheapest-first.
static func b1FastPathEligiblePure(
enabled: Bool,
allowFastPath: Bool,
modelId: String,
kvQuantEnabled: Bool,
temperature: Float,
topP: Float?,
topK: Int?,
seed: UInt64?,
promptTokenCount: Int,
maxTokens: Int,
maxContextLength: Int,
cacheScope: String,
activeBridgeCount: Int,
pendingRequestCount: Int,
fastPathActive: Bool,
hasContainer: Bool
) -> Bool {
guard enabled else { return false }
// Caller opt-in. The engine consumer clears this for tool-bearing
// requests: the fast path is greedy text-only and cannot reproduce the
// engine's raw-text tool-call contract (`container.generate` may parse a
// call into a `.toolCall` event, CONSUMING the text — see the runner's
// `.toolCall` handling), so tool requests must stay on the engine path.
guard allowFastPath else { return false }
// Family gate: only Gemma-4 is profiled + validated for this bypass, and
// its greedy / EOS behavior is only known-good there. Every other family
// (different EOS sets, tool/stop conventions) defers to the batched engine.
guard modelId.lowercased().contains("gemma") else { return false }
// KV quantization: batched-engine admission reserves at the REDUCED
// (quantized) per-token KV rate, but `ModelContainer.generate` allocates a
// full fp16 KV cache. A fast-path reservation sized at the quantized rate
// would under-count ~2x and risk a unified-memory OOM, so whenever KV
// quant is active we defer to the engine (which owns the quantized cache).
guard !kvQuantEnabled else { return false }
// Pure greedy only: temperature 0 and no nucleus / top-k truncation.
// (minP / repetition / presence / frequency penalties are not part of
// the tokenized submit surface, so temperature + topP + topK fully
// characterize "greedy" here.)
guard temperature == 0 else { return false }
guard topP == nil || topP == 0 else { return false }
guard topK == nil || topK == 0 else { return false }
// A seed implies sampling intent; greedy ignores it, but treat its
// presence as "not the simple greedy case" and defer to the engine.
guard seed == nil else { return false }
guard maxTokens > 0 else { return false }
// Need a real prompt to prefill (a 0-token prompt has no greedy seed).
guard promptTokenCount > 0 else { return false }
// Context window: the fast path runs a cold prefill of the WHOLE prompt
// and decodes up to `maxTokens` against one fresh cache. If that span
// exceeds the model's context window, defer to the engine path — it
// enforces context limits and emits the precise context-overflow
// rejection. `maxContextLength == 0` ⇒ context unknown ⇒ skip this gate
// (the remaining gates, incl. the token-budget guard upstream, still apply).
if maxContextLength > 0 {
guard promptTokenCount + maxTokens <= maxContextLength else { return false }
}
// No prefix-cache scope: the fast path runs a cold prefill against a
// fresh cache and does not participate in the checkpoint / engine prefix
// tiers, so a scoped request keeps the engine path to retain cache reuse.
guard cacheScope.isEmpty else { return false }
// Exclusive: no other in-flight or queued work. Concurrent batched work
// would defeat the single-row assumption (shared GPU + KV headroom).
guard activeBridgeCount == 0 else { return false }
guard pendingRequestCount == 0 else { return false }
// And no OTHER fast-path task already running (explicit single-row gate;
// belt-and-suspenders with the activeBridgeCount check, since a running
// fast path also holds a bridge).
guard !fastPathActive else { return false }
// Need a live container to generate against.
guard hasContainer else { return false }
return true
}

// MARK: - Runner

/// Drive a single greedy request through `ModelContainer.generate` and
/// translate its `Generation` events onto the scheduler's `GenerationEvent`
/// stream. Mirrors `runBridge`'s lifecycle (admission / first-token / finish
/// bookkeeping and terminal `.info` / `.error` mapping) but sources tokens
/// from the single-sequence generator instead of `engine.core.streamOutputs`.
///
/// The spawned task is tracked in `fastPathTasks[id]` so `cancel` /
/// `cancelAll` / `stopCurrentEngine` can tear it down; it removes its own
/// handle on completion. The caller (`submitTokenized`) is responsible for
/// having inserted the bridge and reserved KV before this runs, and for
/// wiring `continuation.onTermination`.
func runGreedyFastPath(
requestId id: String,
container: ModelContainer,
promptTokens: [Int],
maxTokens: Int,
continuation: AsyncStream<GenerationEvent>.Continuation
) {
let scheduler = self
let promptCount = promptTokens.count
let task = Task {
// Token-only input (no media). `MLXArray(promptTokens)` is a cheap
// host-side copy; the GPU work happens inside `generate`.
let lmInput = LMInput(tokens: MLXArray(promptTokens))
// temperature 0 ⇒ ArgMaxSampler. topP/topK/penalties left at their
// defaults are inert under greedy. maxTokens bounds the decode.
let params = GenerateParameters(maxTokens: maxTokens, temperature: 0)

// Admission ≈ now: prefill is about to begin. Drives the
// pending-timeout predicate and starts the prefill-EWMA window.
await scheduler.recordAdmission(requestId: id, at: .now)

let genStream: AsyncStream<Generation>
do {
genStream = try await container.generate(input: lmInput, parameters: params)
} catch {
_ = await scheduler.recordFinish(
requestId: id, promptTokens: promptCount,
completionTokens: 0, success: false)
continuation.yield(.error(
"fast path generation failed: \(error.localizedDescription)"))
continuation.finish()
await scheduler.clearFastPathTask(id)
return
}

var sawFirstToken = false
// Count every streamed chunk as >= 1 completion token. The terminal
// `.info` carries the EXACT generation count, but it only arrives on a
// clean finish; on cancellation the loop breaks before it, so without
// this running tally `recordFinish` would settle at 0 completion tokens
// and the coordinator would bill $0 for work already streamed to the
// client. `recordFinish` takes max(observed, terminal), so a clean
// finish still uses the exact `.info` count (>= the chunk tally).
var streamedTokens = 0
var terminalCompletion: Int? = nil
var reportedPrompt = promptCount
// Defensive: the greedy text-only fast path should never see a parsed
// tool call (tool requests are kept on the engine path by the caller's
// `allowFastPath` gate). If one is surfaced anyway we cannot faithfully
// reproduce the engine's raw-text behavior, so we FAIL rather than drop.
var sawToolCall = false

for await gen in genStream {
// Cooperative cancellation: a client cancel / model reload cancels
// this task; break and let the finish bookkeeping below run.
if Task.isCancelled { break }
switch gen {
case .chunk(let text):
if !sawFirstToken {
sawFirstToken = true
await scheduler.recordFirstToken(requestId: id, at: .now)
}
streamedTokens += 1
if !text.isEmpty {
continuation.yield(.chunk(text))
}
case .info(let info):
reportedPrompt = info.promptTokenCount
terminalCompletion = info.generationTokenCount
case .toolCall:
// `container.generate` parsed a tool call (and may have
// CONSUMED its text rather than emitting it as `.chunk`s).
// Silently dropping it would lose the call; the engine path
// emits raw text and never `.toolCall`, so we cannot match it
// here. Mark failure and stop.
sawToolCall = true
}
if sawToolCall { break }
}

let cancelled = Task.isCancelled
// Billing-safe completion count: terminal exact count when present,
// otherwise the streamed-chunk lower bound (covers cancel + tool-call
// failure, where no `.info` arrived).
let completionTokens = max(terminalCompletion ?? 0, streamedTokens)
let succeeded = !cancelled && !sawToolCall
// Reuse the engine bridge's finish bookkeeping: removes the bridge,
// updates the decode + prefill EWMA, releases the KV reservation, and
// returns billing-safe usage counts (max of observed vs. terminal).
let usage = await scheduler.recordFinish(
requestId: id,
promptTokens: reportedPrompt,
completionTokens: completionTokens,
success: succeeded)

// Emit delivered usage (so a listener can bill partial work) before
// any terminal error, mirroring the engine bridge.
if !succeeded, usage.promptTokens > 0 || usage.completionTokens > 0 {
continuation.yield(.info(
promptTokens: usage.promptTokens,
completionTokens: usage.completionTokens,
tokensPerSecond: usage.tps))
}
if cancelled {
continuation.yield(.error("request cancelled"))
} else if sawToolCall {
continuation.yield(.error(
"fast path does not support tool calls; please retry"))
} else {
continuation.yield(.info(
promptTokens: usage.promptTokens,
completionTokens: usage.completionTokens,
tokensPerSecond: usage.tps))
}
continuation.finish()
await scheduler.clearFastPathTask(id)
}
fastPathTasks[id] = task
}

// MARK: - Task tracking / teardown

/// Remove a finished fast-path task handle. Called by the task itself on
/// completion. Safe for an unknown id.
func clearFastPathTask(_ id: String) {
fastPathTasks.removeValue(forKey: id)
}

/// Cancel the in-flight fast-path task for `id`, if any. Returns true when a
/// task existed and was cancelled. The task observes `Task.isCancelled`,
/// runs its finish bookkeeping (KV release, bridge removal, terminal events)
/// and clears its own handle.
@discardableResult
func cancelFastPathTask(_ id: String) -> Bool {
guard let task = fastPathTasks[id] else { return false }
task.cancel()
return true
}

/// Cancel every in-flight fast-path task (model reload / `cancelAll`). Each
/// task self-removes; callers that also clear `fastPathTasks` (e.g.
/// `stopCurrentEngine`) make late `clearFastPathTask` calls harmless no-ops.
func cancelAllFastPathTasks() {
for task in fastPathTasks.values { task.cancel() }
}

/// Cancel AND fence every in-flight fast-path task — used by
/// `stopCurrentEngine` before it nil's `modelContainer` and clears the MLX
/// cache. Unlike the engine (which is fenced by `stopAndWait`), a fast-path
/// task runs off-engine inside `ModelContainer.generate`, holding and running
/// GPU work against the model + its KV cache. If teardown freed that state
/// while a task were still mid-`generate`, it could touch released model/MLX
/// state. Awaiting each task's value blocks until it has observed
/// cancellation, run its finish bookkeeping (KV release + bridge removal +
/// terminal events) and dropped its model/iterator references.
///
/// The handles are snapshotted first so a self-removing `clearFastPathTask`
/// during the awaits cannot mutate the collection being iterated. The await
/// suspends the actor so those actor-isolated callbacks make progress; no NEW
/// fast path can start meanwhile because `stopCurrentEngine` has already
/// nil'd `engine` (every submit path short-circuits on a nil engine).
/// Idempotent: a no-op when nothing is in flight.
func waitForFastPathTasks() async {
let inflight = Array(fastPathTasks.values)
guard !inflight.isEmpty else { return }
for task in inflight { task.cancel() }
for task in inflight { await task.value }
fastPathTasks.removeAll()
}
}

// MARK: - Test support
//
// Internal + `@testable`-only; dead-code-stripped from production binaries.

extension BatchScheduler {
/// Force the B=1 fast-path enablement gate on/off, bypassing the env flags.
/// `nil` restores env-driven behavior. Lets a benchmark A/B the fast path vs.
/// the batched engine in a single process (mutating `ProcessInfo`'s cached
/// environment mid-run is unreliable).
func _setForceB1FastPathForTest(_ value: Bool?) {
_forceB1FastPathForTest = value
}

/// Test accessor: number of in-flight fast-path tasks currently tracked.
func _fastPathTaskCountForTest() -> Int { fastPathTasks.count }
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 [CRITICAL] 🧩 370-line fast path implementation with single consumer adds significant complexity

💡 Suggestion: Consider if this optimization justifies the maintenance burden - the fast path duplicates engine functionality, adds new failure modes, and requires careful coordination with the existing batched engine

📊 Score: 4×4 = 16 · Category: over-abstraction

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants