feat(provider): adopt Gemma4 vMLX decode stack by Gajesh2007 · Pull Request #470 · Layr-Labs/d-inference

Gajesh2007 · 2026-06-26T06:12:56Z

Summary

Points libs/mlx-swift-lm at Layr-Labs/mlx-swift-lm@main after feat: inherit Gemma4 vMLX decode stack mlx-swift-lm#56.
Points libs/mlx-swift at e20ea3d (feat/inherit-upstream-2026-06) intentionally, not main, because mlx-swift@main currently declares swift-tools-version: 6.3;(experimentalCGen) while d-inference is pinned to Swift 6.1.
Adds a gated provider-side Gemma4DecodeProfileTests live benchmark for Gemma-4-26B-A4B raw B=1 decode and optional text sample output.

Results

Integrated local package release benchmark: 79.5 tok/s on mlx-community/gemma-4-26b-a4b-it-8bit (M4 Max, B=1, 128 generated tokens).
1k/1k real-content benchmark: 75.16 tok/s decode, 0.9315 ms/token prefill, coherent 4.7k-char output.
Prior local provider stack raw baseline: ~39 TPS.
Python mlx-vlm on the same machine/model: 69.7 TPS.
Real text sample was coherent with DARKBLOOM_GEMMA_PRINT_TEXT=1.

Before

flowchart LR
  A[Provider loads Gemma4 VLM] --> B[old submodule commits]
  B --> C[old Gemma4 VLM/MoE/RoPE path]
  C --> D[raw B=1 decode]
  D --> E[~39 TPS]

After

flowchart LR
  A[Provider loads Gemma4 VLM] --> B[mlx-swift e20ea3d + mlx-swift-lm main]
  B --> C[vMLX-derived Gemma4 VLM hot path]
  C --> D[compiled GLU helpers + weighted expert combine + KV compile foundation]
  D --> E[raw B=1 decode]
  E --> F[79.5 TPS release raw bench]

Verification

cd /tmp/dbval/gemma4rawbench && swift run -c release Gemma4RawBench against this repo's local packages: 79.5 TPS.
1k/1k benchmark artifacts: /tmp/dbval/gemma4_1k_output.txt, /tmp/dbval/gemma4_1k_prompt_ids.json, /tmp/dbval/gemma4_1k_output_ids.json.
cd /tmp/dbval/d-inference-pr-main/provider-swift && swift build --build-tests: passed after cleaning stale SwiftPM cache.
Gated provider test added: DARKBLOOM_LIVE_MLX_TESTS=1 DARKBLOOM_LIVE_MLX_GEMMA=1 swift test --filter Gemma4DecodeProfileTests.

Notes

mlx-swift@main should be adopted later only after either d-inference moves to Swift 6.3 or mlx-swift main restores a Swift-6.1-compatible manifest. This PR keeps the documented Swift 6.1 toolchain path working.
Release provider swift test -c release remains blocked by an unrelated kv-se-harness compile issue, so the fair release throughput number was measured with a small temp executable depending on the same local packages.

vercel · 2026-06-26T06:13:01Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
d-inference	Ready	Preview	Jun 26, 2026 8:24pm
d-inference-console-ui-dev	Ready	Preview	Jun 26, 2026 8:24pm
d-inference-landing	Ready	Preview	Jun 26, 2026 8:24pm

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 269e7fc751

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-26T06:23:12Z

+        if LiveInferenceFixtures.ensureMetallibColocated() == nil {
+            Issue.record("mlx.metallib not found near test bundle or in MLX_METALLIB_PATH/SOURCE")
+        }


Return when metallib bootstrap fails

When this live profile is enabled on a machine without mlx.metallib colocated, the test records an issue but then continues into MLX setup/model execution. LiveInferenceFixtures.ensureMetallibColocated() is documented and used elsewhere as a guard because MLX can crash on the first GPU call without the metallib, so this intended skip/failure path can abort the whole filtered benchmark run instead of exiting cleanly.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-26T06:23:12Z

@@ -1 +1 @@
-Subproject commit 2b4b0d8dd123b7fde4f8f5fbcfd4fe8e3e4fe242
+Subproject commit 461a0ab01c42fdae4d0b400259203ae3c7009224


Restore Gemma4 video preprocessing

For Gemma4 video_url requests, this submodule bump drops the processor path that consumed input.videos: the previous Gemma4.swift built LMInput.ProcessedVideo and expanded <|video|> placeholders, while the new Gemma4.swift has no input.videos/ProcessedVideo handling. VLMRequestInference.buildUserInput still attaches videos and routes media requests through the container's processor, so advertised Gemma4 video inputs will be sent to generation without frame features (or hit placeholder/feature mismatches) rather than being answered from the video.

Useful? React with 👍 / 👎.

ethenotethan

Automated Code Review — Layr-Labs/d-inference#

Verdict: COMMENT

Security — ✅ No issues found

Performance — 1 finding(s)

🔵 [INFO] provider-swift/Tests/ProviderCoreTests/Gemma4DecodeProfileTests.swift:93 — Array grows in loop without pre-allocation despite reserveCapacity
- Suggestion: Move reserveCapacity(64) before the loop or use a fixed-size array since the loop count is known (64 iterations)

Type_diligence — ✅ No issues found

Additive_complexity — ✅ No issues found

1 finding(s) total, 0 blocking. Verdict: COMMENT.

🤖 Automated review by Centaur · DAR-186

ethenotethan

Automated Code Review — Layr-Labs/d-inference#

Verdict: COMMENT

Security — ✅ No issues found

Performance — 1 finding(s)

🔵 [INFO] provider-swift/Tests/ProviderCoreTests/Gemma4DecodeProfileTests.swift:93 — Array grows in loop without pre-allocation despite known capacity
- Suggestion: Move reserveCapacity(64) before the loop or initialize with capacity: var generated = [Int](reserving: 64)

Type_diligence — ✅ No issues found

Additive_complexity — ✅ No issues found

1 finding(s) total, 0 blocking. Verdict: COMMENT.

🤖 Automated review by Centaur · DAR-186

github-actions · 2026-06-26T06:36:21Z

This PR introduces a performance-oriented B=1 greedy fast-path for single exclusive inference requests; the security posture is neutral-to-slightly-positive with one new attack surface item worth tracking.

Trust Boundaries Touched

TB-002 (Coordinator → Provider WebSocket) — billing-safe admission accounting, KV budget
TB-003 (Provider Operator vs Process) — new GPU execution path, cancellation/teardown
TB-007 (Provider Inference Engine) — primary surface: new code path for token generation, KV reservation, cancellation, model-teardown fencing

Threat Analysis

T-007 / T-027 — Model weight substitution / manipulated outputs
ℹ️ Neutral. The fast path calls ModelContainer.generate — the same container the VLM path already uses. It doesn't bypass weight loading, model identity, or the hash advertised to the coordinator. No change to the weight-hash advertisement or coordinator enforcement (SEC-007 remains open).

T-008 — Provider sends plaintext SSE chunks on encryption failure (SEC-016)
ℹ️ Neutral. The fast path produces GenerationEvent values that flow through the same downstream encryption layer the batched engine uses. The truncated diff doesn't show runGreedyFastPath's terminal event handling in full, but the continuation.yield(.error(...)) rejection patterns are consistent with fail-closed semantics. Reviewers should confirm that runGreedyFastPath (the body not shown in the truncated portion of BatchScheduler+B1FastPath.swift) emits .error — not a plaintext data chunk — on any encryption or generation failure, and does not populate a plaintext data field (mirroring the existing SEC-016 risk in ProviderLoop.swift:436–451).

T-028 — Residual inference data in GPU memory
ℹ️ Neutral, with a minor note. stopCurrentEngine now calls await waitForFastPathTasks() before MLX.Memory.clearCache() (BatchScheduler.swift, lines ~414–428). This fence is correct and security-relevant: without it, a fast-path task still holding a live GPU KV cache could race the clearCache() call, leaving residual prompt tensors in an inconsistent state. The fence tightens teardown relative to the status quo. The underlying open issue (GPU buffers not explicitly zeroed between requests at the Metal level) is unchanged.

T-041 — Cross-tenant prefix-cache sharing / TTFT timing oracle
ℹ️ Neutral. The fast path is Gemma-only and temperature-0 (greedy), so prefix-cache interactions are scoped. KV reservation/release goes through reserveKVForRequest / releaseRequestResources — the same accounting path — so the budget tracking that guards cache capacity is preserved. The fast path does not introduce new sharing semantics.

T-010 — Cancellation not propagated to inference engine
✅ Strengthens. cancel(requestId:) now checks cancelFastPathTask first and returns early (BatchScheduler+Submit.swift:491–496). cancelAll() calls cancelAllFastPathTasks() before the engine abort (BatchScheduler+Submit.swift:530–535). Fast-path tasks are explicitly torn down in stopCurrentEngine via waitForFastPathTasks() (BatchScheduler.swift:414–428). This is a meaningful improvement in cancellation coverage for the new execution path.

New Attack Surface Not Covered by an Existing Threat

Env-flag side-channel for operational state inference (BatchScheduler+B1FastPath.swift:45–48)

b1GreedyFastPathEnabled() reads ProcessInfo.processInfo.environment on every eligibility check ("cheap" per the comment). This is fine for a provider-operator–controlled process. However, the rejection path for non-fast-path requests while a fast-path task is active surfaces a new observable timing/error channel to the consumer:

"token_budget_exhausted: a single-request fast path is active; retry shortly"

(BatchScheduler+Submit.swift:107–111 and BatchScheduler+Submit.swift:300–306)

This error string leaks to the consumer that the provider is currently processing exactly one exclusive greedy request. Under the threat model (ADV-002, malicious consumer), a consumer who can probe retry behavior could infer:

Whether the fast path is enabled on this provider (feature fingerprinting).
Whether another consumer is actively mid-inference (presence oracle, similar in kind to the TTFT timing channel in T-041).

Recommendation: use a generic retryable error string (e.g. the existing "token_budget_exhausted" without the explanatory suffix) or the same opaque message the KV-budget rejection uses. The informational suffix has no consumer-facing value and narrows the ambiguity consumers currently have about provider internal state.

Open Findings Resolved

None. SEC-007, SEC-016, SEC-035 remain open and are untouched by this PR.

Summary

The GPU teardown fence (waitForFastPathTasks before clearCache) and the explicit fast-path cancellation wiring are the two concrete security improvements here. The one actionable item is the overly descriptive rejection error string that leaks fast-path execution state to consumers — suggest trimming it to a generic token_budget_exhausted before merge.

🔐 Threat model: docs/threat-model.yaml · Updates on each push to this PR

ethenotethan

Automated Code Review — Layr-Labs/d-inference#

Verdict: COMMENT

Security — ✅ No issues found

Performance — 1 finding(s)

🔵 [INFO] provider-swift/Tests/ProviderCoreTests/Gemma4DecodeProfileTests.swift:94 — Array grows in loop without pre-allocation despite reserveCapacity
- Suggestion: Move reserveCapacity(64) before the loop or use a fixed-size array since the loop count is known (64 iterations)

Type_diligence — ✅ No issues found

Additive_complexity — ✅ No issues found

1 finding(s) total, 0 blocking. Verdict: COMMENT.

🤖 Automated review by Centaur · DAR-186

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e0d0c75412

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-26T06:46:14Z

@@ -1 +1 @@
-Subproject commit 3c50ad693a7a3fbfbcc1d7ddf834a37e9ee8cf14
+Subproject commit 3b6af6e26b009995e127c9c1921f185c54aed379


Keep mlx-swift compatible with the pinned Swift toolchain

This gitlink now points the provider's local ../libs/mlx-swift dependency at a commit whose Package.swift declares // swift-tools-version: 6.3;(experimentalCGen), while the repo still documents and pins Swift 6.1 in mise.toml and provider-swift/Package.swift. Developers following the documented mise install && make provider-build path, and any CI/release runner with Swift <6.3, will fail during package loading with a minimum tools-version error before any provider code compiles; either keep this submodule on a Swift-6.1-compatible commit or update the repo/toolchain/workflows together.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-26T06:46:14Z

@@ -1 +1 @@
-Subproject commit b1df0f22424f94005685baf79192ff08e9c43eb9
+Subproject commit 7ce9e45b40add0d6e66bc69aeed6bfb896009e04


Preserve Gemma4's bidirectional vision mask

For Gemma4 image_url requests, this submodule bump removes the VLM text path that built a visualTokenMask and passed it as imageTokenMask so use_bidirectional_attention == "vision" could overlay bidirectional attention within image-token spans; the new Gemma4 path has no imageTokenMask/useBidirectionalAttention handling and always builds ordinary causal masks. That means earlier soft image tokens cannot attend to later patches during prefill, corrupting the visual features the language model consumes even though image inputs are still accepted.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-26T06:46:14Z

@@ -1 +1 @@
-Subproject commit b1df0f22424f94005685baf79192ff08e9c43eb9
+Subproject commit 7ce9e45b40add0d6e66bc69aeed6bfb896009e04


Keep Gemma4 visual delimiter tokens

For Gemma4 image prompts, the previous processor expanded each <|image|> placeholder as boi + image_token * count + eoi, matching the model config's boi_token_id/eoi_token_id; this submodule bump replaces that with only repeated image tokens and the new config no longer reads those delimiter ids. Requests that include image_url are therefore still routed to Gemma4 but the visual span is missing the boundary tokens the model was trained to see, so image understanding can degrade or fail even when the feature count matches.

Useful? React with 👍 / 👎.

Gajesh2007 · 2026-06-26T06:51:13Z

Provider-stack Gemma4 benchmark

Ran a release executable depending on ProviderCore and the PR's local packages, using the real BatchScheduler path (loadModel + submitTokenized) with Gemma-4-26B-A4B 8-bit.

Metric	Value
Prompt tokens	973
Requested output	512 tokens
TTFT (submit → first chunk)	2180.0 ms
Total request time	10586.8 ms
Completion chunks	512
Decode window	8389.6 ms
Measured decode TPS	60.91 tok/s
Scheduler-reported TPS	60.90 tok/s

Notes:

This is the full d-inference provider stack, not the raw model loop.
Raw release model loop remains ~79.5 tok/s; provider stack overhead + streaming + ~1k prompt KV growth bring end-to-end decode to ~61 tok/s for this long prompt.
Output was coherent long-form technical prose; sample saved at /tmp/dbval/provider_gemma4_stack_output.txt.

Excerpt:

Architectural Optimization of Sparse Mixture-of-Experts Inference on Apple Silicon\n\nThe deployment of Sparse Mixture-of-Experts (SMoE) models, such as Mixtral 8x7B, presents a unique computational paradox on Unified Memory Architecture (UMA) systems like Apple Silicon...

ethenotethan

Automated Code Review — Layr-Labs/d-inference#

Verdict: COMMENT

Security — ✅ No issues found

Performance — 1 finding(s)

🔵 [INFO] provider-swift/Tests/ProviderCoreTests/Gemma4DecodeProfileTests.swift:94 — Array grows in loop without pre-allocation despite reserveCapacity
- Suggestion: Move reserveCapacity(64) before the loop or use a fixed-size array since the loop count is known (64 iterations)

Type_diligence — ✅ No issues found

Additive_complexity — ✅ No issues found

1 finding(s) total, 0 blocking. Verdict: COMMENT.

🤖 Automated review by Centaur · DAR-186

ethenotethan

Automated Code Review — Layr-Labs/d-inference#

Verdict: REQUEST_CHANGES

Security — ✅ No issues found

Performance — 2 finding(s) (1 blocking)

🔵 [INFO] provider-swift/Tests/ProviderCoreTests/ContinuousBatchingLiveTests.swift:235 — Array.prefix(while:) creates new array copies in hot path
- Suggestion: Use iterators or indices to avoid copying arrays when checking for EOS tokens and computing prefix matches
🟡 [MEDIUM] provider-swift/Tests/ProviderCoreTests/Gemma4DecodeProfileTests.swift:94-95 — Array grows in loop without pre-allocation despite reserveCapacity
- Suggestion: Use generated.append(token) directly since capacity is already reserved, or pre-allocate with Array(repeating:count:) and use index-based assignment

Type_diligence — ✅ No issues found

Additive_complexity — ✅ No issues found

2 finding(s) total, 1 blocking. Verdict: REQUEST_CHANGES.

🤖 Automated review by Centaur · DAR-186

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e0b67c5f45

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-26T17:51:32Z

@@ -1 +1 @@
-Subproject commit b1df0f22424f94005685baf79192ff08e9c43eb9
+Subproject commit e49e0ff0e43aee0e9e554e1a9120f62f00bcf09c


Reserve Gemma video KV for every sampled frame

With this submodule bump, Gemma4 video preprocessing now samples up to 32 frames and expands each <|video|> to config.imageSeqLength soft tokens per frame, but the provider still reserves a flat 4,096 KV tokens per video in VLMRequestInference.projectedKVTokens (visionTokensPerVideo and the .videoURL branch). For Gemma configs with ~256/280 soft tokens per frame, a full 32-frame clip needs roughly 8–9k vision tokens before text/output, so concurrent video requests can pass the 90% KV cap with about half their actual prompt KV unreserved and then OOM or destabilize the provider during generation.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-26T17:51:32Z

+        )
+    )
+    func gemma4VLMLongContextMixedB3() async throws {
+        try ensureMetallibAvailable()


Stop after missing metallib in live tests

When these new Gemma live tests are enabled on a machine without a colocated mlx.metallib, ensureMetallibAvailable() only records an issue and returns; execution then immediately proceeds into MLX.GPU.set and model loading. LiveInferenceFixtures.ensureMetallibColocated() documents that callers should skip instead of reaching the first GPU call, and the new Gemma4DecodeProfileTests path does return on nil, so these added ContinuousBatching tests can still abort a filtered benchmark run rather than exiting cleanly.

Useful? React with 👍 / 👎.

ethenotethan

Automated Code Review — Layr-Labs/d-inference#

Verdict: COMMENT

Security — ✅ No issues found

Performance — ✅ No issues found

Type_diligence — ✅ No issues found

Additive_complexity — ✅ No issues found

✅ All four passes clean. No issues found.

🤖 Automated review by Centaur · DAR-186

Gajesh2007 · 2026-06-26T18:25:16Z

Provider B=1 fast path benchmark

Added an env-gated B=1 greedy fast path (DARKBLOOM_B1_GREEDY_FAST_PATH=1 or DARKBLOOM_GEMMA_B1_FAST_PATH=1) that bypasses continuous batching for one exclusive greedy request and uses the proven single-stream ModelContainer.generate path.

Corrected PR pins during this run:

libs/mlx-swift: e20ea3d
libs/mlx-swift-lm: 791c17f

Release provider-stack result (Gemma-4-26B-A4B 8-bit)

Mode	Decode TPS	Decode window	Prompt	Output
Batched engine + fast BatchRotatingKV	~63 tok/s	~8.1s	973	512
B=1 greedy fast path	76.48 tok/s	6694.8 ms	973	512

Other details:

TTFT: 811.2 ms
Total: 7509.0 ms
Scheduler-reported TPS: 76.45 tok/s
Output remained coherent.
Log: /tmp/dbval/provider_gemma4_b1fast_clean_rerun.log

This closes the B=1 provider path gap to the raw model loop (~75-79 tok/s) while keeping the normal batched engine path as the default unless explicitly enabled by env.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6819a45854

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-26T18:36:54Z

+            let kvOutcome = await reserveKVForRequest(
+                requestId: id,
+                requestTokens: requestBudget,
+                reservationTokens: requestBudget,
+                restorePlanned: false


Reserve fp16 KV for the fast path

When the B=1 path is enabled on a KV-quantized model, this branch reserves through reserveKVForRequest, which charges kvBytesPerToken for the batched/quantized cache. The request then runs through ModelContainer.generate, the same non-batched path that reserveVisionRequest explicitly accounts with fp16KVBytesPerToken; as a result the fast-path request can hold substantially more KV than it reserved and let later admissions pass the 90% cap, causing OOMs under concurrent traffic. Use the fp16 byte reservation or disable this path when KV quantization is active.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-26T18:36:54Z

+                    Task { await scheduler.cancel(requestId: id) }
+                }
+            }
+            return stream


Keep fast-path requests in the concurrency gate

This returns before planner.admit, so the planner never counts the direct ModelContainer.generate request as active. While that request is decoding, a later non-fast-path submit can still be queued and scheduled because BatchQueuePlanner sees active.count == 0 even when maxConcurrentRequests == 1, which breaks the branch's “single exclusive” assumption and can overlap engine work with the fast path. Either register the fast-path request with the planner or block new engine admissions until it finishes.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-26T18:36:54Z

+        cancelAllFastPathTasks()
+        fastPathTasks.removeAll()


Wait for fast-path tasks before clearing MLX state

On unload/reload or a liveness self-restart during a fast-path generation, this only cancels the task and immediately drops the handle before teardown continues to nil the container and clear MLX memory. Swift cancellation is cooperative, so the task may still be inside container.generate or waiting for its next event while the old model/KV memory is being released, risking crashes or use-after-teardown; keep the handles and await/fence those tasks before clearing model resources.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-26T18:36:54Z

+            let lmInput = LMInput(tokens: MLXArray(promptTokens))
+            // temperature 0 ⇒ ArgMaxSampler. topP/topK/penalties left at their
+            // defaults are inert under greedy. maxTokens bounds the decode.
+            let params = GenerateParameters(maxTokens: maxTokens, temperature: 0)


Pass effective EOS tokens through the fast path

When the fast path is enabled for models whose stop tokens are augmented at load time, such as GPT-OSS/Harmony via effectiveEOSTokenIds, this direct ModelContainer.generate call does not receive the same EOS set that the batched Scheduler is built with. Those requests can therefore continue past model-specific return/call/end tokens until maxTokens or leak control tokens, while the engine path would stop correctly; thread the effective EOS set into the fast path or restrict it to models whose container config already contains the full stop set.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-26T18:36:54Z

+                case .chunk(let text):
+                    if !sawFirstToken {
+                        sawFirstToken = true
+                        await scheduler.recordFirstToken(requestId: id, at: .now)
+                    }


Count streamed tokens before cancellation

If a client disconnects after receiving chunks but before ModelContainer.generate emits its terminal .info, the cancellation path breaks out with completionTokens still at 0 because chunk handling only records first-token timing. recordFinish then reports/bills zero completion tokens even though text was already delivered; update progress as chunks arrive or otherwise preserve delivered token counts before honoring cancellation.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-26T18:36:54Z

+            topK: topK,
+            seed: seed,
+            maxTokens: maxTokens,
+            cacheScope: cacheScope
+        )


Preserve context-window checks on the fast path

When the fast path is enabled, eligibility is decided without the prompt length or maxContextLength, and this branch returns before the planner's maxTokensPerBatch/context-window rejection runs. A greedy request whose prompt is longer than the model context but still fits the memory token budget can therefore be sent directly to ModelContainer.generate instead of producing the deterministic context error the engine path emits, risking runtime failures or malformed output for over-context prompts.

Useful? React with 👍 / 👎.

ethenotethan

Automated Code Review — Layr-Labs/d-inference#

Verdict: REQUEST_CHANGES

Security — 1 finding(s)

🔵 [INFO] provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:47-49 — Environment variable values read without validation
- Suggestion: Validate environment variable values are exactly "1" rather than using == comparison which could match unintended values

Performance — 2 finding(s) (1 blocking)

🟡 [MEDIUM] provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:139 — Synchronous container.generate call blocks the actor on hot path
- Suggestion: Move the container.generate call outside the actor context or use async/await properly to avoid blocking the scheduler actor
🔵 [INFO] provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:155-175 — Unbounded string accumulation in generation loop
- Suggestion: Pre-allocate string buffer capacity or use streaming approach to avoid repeated string reallocations during text generation

Type_diligence — ✅ No issues found

Additive_complexity — 1 finding(s) (1 blocking)

🟡 [MEDIUM] provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:1-279 — New 279-line fast path implementation duplicates existing functionality
- Suggestion: Consider if this optimization justifies the added complexity. The fast path reimplements generation logic that already exists in the batched engine, adding ~300 lines of code and new state tracking for a performance optimization that only applies to single greedy requests.

4 finding(s) total, 2 blocking. Verdict: REQUEST_CHANGES.

🤖 Automated review by Centaur · DAR-186

Gajesh2007 · 2026-06-26T19:14:15Z

Review fixes: media KV reservation + live metallib guards

Addressed two provider-side review findings:

Reserve Gemma video KV for every sampled frame
- VLMRequestInference now reserves for maxVideoFramesSampled (32) * visionTokensPerImage instead of a flat 4096 tokens.
- This safely covers Gemma4's video path where each sampled frame expands into a visual soft-token block.
- Added VLMCapReservationTests.projectedKVTokensReservesEverySampledVideoFrame.
Stop after missing metallib in live tests
- ContinuousBatchingLiveTests.ensureMetallibAvailable() now returns false on missing metallib, and all call sites return before GPU setup/model loading.

Verification:

swift build --build-tests: passed.
swift test --filter VLMCapReservationTests: 14/14 passed.
swift test --filter VLMRequestInferenceTests: 44/44 passed.

ethenotethan

Automated Code Review — Layr-Labs/d-inference#

Verdict: REQUEST_CHANGES

Security — 1 finding(s)

🔵 [INFO] provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:47-49 — Environment variable values read without validation
- Suggestion: Validate environment variable values are exactly "1" rather than using string equality check that could match unintended values

Performance — 2 finding(s) (1 blocking)

🟡 [MEDIUM] provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:158 — Synchronous container.generate call blocks the actor on hot path
- Suggestion: Use async/await pattern or move the blocking operation off the actor's execution context
🔵 [INFO] provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:225 — Dictionary grows without bounds as fast path tasks accumulate
- Suggestion: Add periodic cleanup or size limits to fastPathTasks dictionary to prevent memory growth

Type_diligence — ✅ No issues found

Additive_complexity — 3 finding(s) (1 blocking)

🟡 [MEDIUM] provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:1-279 — New 279-line fast path adds complexity for marginal performance gain
- Suggestion: Consider if the 20% TPS improvement justifies adding a parallel execution path with its own task tracking, eligibility logic, and test overrides. The batched engine already handles B=1 cases.
🔵 [INFO] provider-swift/Sources/ProviderCore/Inference/BatchScheduler.swift:264-278 — Three new state fields for fast path tracking
- Suggestion: The fastPathTasks dictionary and test override field add state management burden. Consider if the performance benefit outweighs the maintenance cost of parallel execution paths.
🔵 [INFO] provider-swift/Tests/ProviderCoreTests/B1GreedyFastPathTests.swift:1-294 — 294-line test file for eligibility policy that could be simpler
- Suggestion: The eligibility logic is a pure function with 9 parameters. Consider if this level of parameterization and testing is proportional to the feature's complexity.

6 finding(s) total, 2 blocking. Verdict: REQUEST_CHANGES.

🤖 Automated review by Centaur · DAR-186

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8e4f59a49d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-26T19:23:05Z

+                case .info(let info):
+                    reportedPrompt = info.promptTokenCount
+                    completionTokens = info.generationTokenCount
+                case .toolCall:


Forward fast-path tool calls

When the env-gated fast path handles a default-greedy request that includes tools, this branch drops the .toolCall events produced by ModelContainer.generate. MultiModelBatchSchedulerEngine.streamChatCompletion only builds downstream tool calls from raw .chunk text via BatchedToolStreamHandler, and submitTokenized does not pass a “has tools” flag into eligibility, so tool requests can enter this path and complete with no tool call/content. Either disqualify tool requests from the fast path or translate these events into the caller’s expected stream.

Useful? React with 👍 / 👎.

ethenotethan

Automated Code Review — Layr-Labs/d-inference#

Verdict: REQUEST_CHANGES

Security — 1 finding(s)

🔵 [INFO] provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:47-49 — Environment variable values not validated
- Suggestion: Validate environment variable values are exactly "1" rather than using string equality check that could match unintended values

Performance — 3 finding(s) (1 blocking)

🟡 [MEDIUM] provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:139 — Synchronous container.generate call blocks the actor on hot path
- Suggestion: Use async/await pattern or move the blocking operation off the actor's execution context
🔵 [INFO] provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:155 — MLXArray(promptTokens) creates unbounded host-side copy
- Suggestion: Pre-allocate or stream tokens to avoid large memory copies for very long prompts
🔵 [INFO] provider-swift/Sources/ProviderCore/Inference/VLMRequestInference.swift:133 — Multiplication computed repeatedly in hot path
- Suggestion: Pre-compute VLMRequestInference.maxVideoFramesSampled * VLMRequestInference.visionTokensPerImage as a static constant

Type_diligence — ✅ No issues found

Additive_complexity — 2 finding(s) (1 blocking)

🟡 [MEDIUM] provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:1-279 — New 279-line fast path implementation duplicates existing functionality
- Suggestion: Consider if this optimization justifies the added complexity. The fast path reimplements generation logic that already exists in the batched engine, adding ~300 lines of code and a parallel execution path that must be maintained alongside the existing engine. The performance gain (75 vs 63 TPS) may not justify doubling the inference code paths.
🔵 [INFO] provider-swift/Sources/ProviderCore/Inference/BatchScheduler.swift:264-278 — Additional state tracking for parallel execution path
- Suggestion: The fastPathTasks dictionary and _forceB1FastPathForTest flag add state management complexity. Consider if the performance benefit justifies maintaining two separate execution paths with their own lifecycle management.

6 finding(s) total, 2 blocking. Verdict: REQUEST_CHANGES.

🤖 Automated review by Centaur · DAR-186

Gajesh2007 · 2026-06-26T20:24:18Z

B=1 fast-path safety hardening

Addressed the B1 fast-path review findings conservatively:

Fast path is now ineligible when KV quantization is active.
Fast path is Gemma-family only.
Engine-path submissions are blocked while a fast-path task is active, preserving the single-exclusive assumption.
stopCurrentEngine now cancels and awaits fast-path tasks before model/MLX teardown.
Context-window eligibility added.
Tool-call events are not silently dropped: unexpected .toolCall fails the fast path.
Cancellation accounting now preserves delivered chunk count instead of billing zero after streamed output.
MultiModelBatchSchedulerEngine threads allowFastPath: toolHandler == nil so tool requests stay on the engine path.

Verification:

swift build --build-tests: passed.
swift test --filter B1GreedyFastPathTests: 15/15 passed.
Release Gemma4 provider fast-path benchmark after hardening: 76.06 tok/s (973 prompt / 512 output), coherent output.
- Log: /tmp/dbval/provider_gemma4_b1fast_safety.log

ethenotethan

Automated Code Review — Layr-Labs/d-inference#

Verdict: REQUEST_CHANGES

Security — ✅ No issues found

Performance — 2 finding(s) (1 blocking)

🟡 [MEDIUM] provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:206 — Synchronous container.generate call blocks actor on hot path
- Suggestion: Use async/await pattern or move the blocking call off the actor's execution context
🔵 [INFO] provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:370 — Task spawned without explicit cancellation handling in parent context
- Suggestion: Consider using TaskGroup or structured concurrency to ensure proper task lifecycle management

Type_diligence — ✅ No issues found

Additive_complexity — 1 finding(s) (1 blocking)

🔴 [CRITICAL] provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:1-370 — 370-line fast path implementation duplicates engine functionality with single consumer
- Suggestion: Consider if this optimization justifies the maintenance burden of duplicating request lifecycle, KV management, and error handling that already exists in the batched engine

3 finding(s) total, 2 blocking. Verdict: REQUEST_CHANGES.

🤖 Automated review by Centaur · DAR-186

ethenotethan

Automated Code Review — Layr-Labs/d-inference#

Verdict: REQUEST_CHANGES

Security — ✅ No issues found

Performance — 3 finding(s) (1 blocking)

🟡 [MEDIUM] provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:206 — Synchronous container.generate call blocks actor on hot path
- Suggestion: Use async/await pattern or move the blocking call off the actor's execution context
🔵 [INFO] provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:190 — MLXArray(promptTokens) creates unbounded allocation in hot path
- Suggestion: Pre-allocate or reuse MLXArray instances, especially for large prompt token arrays
🔵 [INFO] provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:340-350 — Task spawned without proper resource limits or cancellation timeout
- Suggestion: Add timeout or resource limits to the spawned Task to prevent unbounded resource usage

Type_diligence — ✅ No issues found

Additive_complexity — 1 finding(s) (1 blocking)

🔴 [CRITICAL] provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:1-370 — 370-line fast path implementation with single consumer adds significant complexity
- Suggestion: Consider if this optimization justifies the maintenance burden - the fast path duplicates engine functionality, adds new failure modes, and requires careful coordination with the existing batched engine

4 finding(s) total, 2 blocking. Verdict: REQUEST_CHANGES.

🤖 Automated review by Centaur · DAR-186

ethenotethan · 2026-06-26T21:18:13Z

+            await scheduler.recordAdmission(requestId: id, at: .now)
+
+            let genStream: AsyncStream<Generation>
+            do {


🟡 [MEDIUM] ⚡ Synchronous container.generate call blocks actor on hot path

💡 Suggestion: Use async/await pattern or move the blocking call off the actor's execution context

📊 Score: 3×4 = 12 · Category: blocking_io

ethenotethan · 2026-06-26T21:18:13Z

+        promptTokens: [Int],
+        maxTokens: Int,
+        continuation: AsyncStream<GenerationEvent>.Continuation
+    ) {


🔵 [INFO] ⚡ MLXArray(promptTokens) creates unbounded allocation in hot path

💡 Suggestion: Pre-allocate or reuse MLXArray instances, especially for large prompt token arrays

📊 Score: 2×3 = 6 · Category: unbounded_allocations

ethenotethan · 2026-06-26T21:18:13Z

+    /// The handles are snapshotted first so a self-removing `clearFastPathTask`
+    /// during the awaits cannot mutate the collection being iterated. The await
+    /// suspends the actor so those actor-isolated callbacks make progress; no NEW
+    /// fast path can start meanwhile because `stopCurrentEngine` has already
+    /// nil'd `engine` (every submit path short-circuits on a nil engine).
+    /// Idempotent: a no-op when nothing is in flight.
+    func waitForFastPathTasks() async {
+        let inflight = Array(fastPathTasks.values)
+        guard !inflight.isEmpty else { return }
+        for task in inflight { task.cancel() }
+        for task in inflight { await task.value }


🔵 [INFO] ⚡ Task spawned without proper resource limits or cancellation timeout

💡 Suggestion: Add timeout or resource limits to the spawned Task to prevent unbounded resource usage

📊 Score: 2×3 = 6 · Category: unbounded_goroutines

ethenotethan · 2026-06-26T21:18:13Z

+// Copyright © 2026 Eigen Labs.
+//
+// BatchScheduler B=1 greedy fast path: an env-gated bypass of the
+// continuous-batching engine for a single exclusive, greedy (temperature == 0)
+// request.
+//
+// The batched engine carries continuous-batching overhead a single-row decode
+// does not need: batch tensor (de)allocation per step, the scheduler step loop,
+// the output collector, and cross-thread `RequestOutput` streaming. On Gemma-4
+// that overhead shows up as ~63 TPS through `BatchedEngine` vs ~75 TPS for the
+// raw single-sequence loop (see `Tests/.../Gemma4DecodeProfileTests.swift`).
+//
+// When exactly one request is in flight and it is pure greedy, we run that
+// single-sequence decode through `ModelContainer.generate` — the SAME
+// concurrency-safe path the VLM media path (`VLMRequestInference`) already uses
+// alongside the engine. `ModelContainer.generate` holds the container
+// exclusively only for the prefill, then streams the decode (asyncEval
+// pipelined) on its own task. We translate its `Generation` events to our
+// `GenerationEvent` stream.
+//
+// Safety posture:
+//   * OFF by default; opt in with an env flag.
+//   * Conservative gate — anything that isn't a single exclusive greedy text
+//     request falls back to the batched engine, so the engine path's behavior
+//     is never altered.
+//   * KV byte budget is still reserved/released, and bridge bookkeeping
+//     (heartbeats, decode/prefill EWMA, billing-safe usage) is preserved via
+//     the SAME `recordAdmission` / `recordFirstToken` / `recordFinish` methods
+//     the engine bridge uses.
+//   * The in-flight task is tracked so `cancel` / `cancelAll` /
+//     `stopCurrentEngine` tear it down deterministically.
+
+import Foundation
+import MLX
+import MLXLMCommon
+
+extension BatchScheduler {
+
+    // MARK: - Env gate
+
+    /// True when the operator opted into the B=1 greedy fast path. Two flags are
+    /// accepted: `DARKBLOOM_B1_GREEDY_FAST_PATH` (generic) and
+    /// `DARKBLOOM_GEMMA_B1_FAST_PATH` (Gemma-targeted alias). Either set to `"1"`
+    /// enables it. Read per-call (cheap) so tests can toggle it via the
+    /// environment without restarting the scheduler.
+    static func b1GreedyFastPathEnabled() -> Bool {
+        let env = ProcessInfo.processInfo.environment
+        return env["DARKBLOOM_B1_GREEDY_FAST_PATH"] == "1"
+            || env["DARKBLOOM_GEMMA_B1_FAST_PATH"] == "1"
+    }
+
+    // MARK: - Eligibility
+
+    /// Whether this request can take the single-exclusive greedy fast path.
+    ///
+    /// MUST be evaluated BEFORE the request's own bridge is inserted into
+    /// `activeBridges` — the exclusivity check reads `activeBridges.count`.
+    /// Every condition is conservative: a miss simply defers to the batched
+    /// engine, so this can only shrink the set of requests the fast path serves,
+    /// never change the engine path's correctness. The decision itself is a pure
+    /// function (`b1FastPathEligiblePure`) so it can be unit-tested exhaustively
+    /// without a loaded model.
+    func b1FastPathEligible(
+        temperature: Float,
+        topP: Float?,
+        topK: Int?,
+        seed: UInt64?,
+        promptTokenCount: Int,
+        maxTokens: Int,
+        cacheScope: String,
+        allowFastPath: Bool
+    ) -> Bool {
+        Self.b1FastPathEligiblePure(
+            // Test override wins when set; otherwise consult the env flags.
+            enabled: _forceB1FastPathForTest ?? Self.b1GreedyFastPathEnabled(),
+            allowFastPath: allowFastPath,
+            modelId: modelId,
+            kvQuantEnabled: kvQuantEnabled,
+            temperature: temperature,
+            topP: topP,
+            topK: topK,
+            seed: seed,
+            promptTokenCount: promptTokenCount,
+            maxTokens: maxTokens,
+            maxContextLength: maxContextLength,
+            cacheScope: cacheScope,
+            activeBridgeCount: activeBridges.count,
+            pendingRequestCount: pendingRequestCount,
+            fastPathActive: !fastPathTasks.isEmpty,
+            hasContainer: modelContainer != nil
+        )
+    }
+
+    /// Pure eligibility policy for the B=1 greedy fast path. No actor state — all
+    /// inputs are parameters — so it is fully unit-testable. Order is irrelevant
+    /// to the result (all conditions must hold), but kept cheapest-first.
+    static func b1FastPathEligiblePure(
+        enabled: Bool,
+        allowFastPath: Bool,
+        modelId: String,
+        kvQuantEnabled: Bool,
+        temperature: Float,
+        topP: Float?,
+        topK: Int?,
+        seed: UInt64?,
+        promptTokenCount: Int,
+        maxTokens: Int,
+        maxContextLength: Int,
+        cacheScope: String,
+        activeBridgeCount: Int,
+        pendingRequestCount: Int,
+        fastPathActive: Bool,
+        hasContainer: Bool
+    ) -> Bool {
+        guard enabled else { return false }
+        // Caller opt-in. The engine consumer clears this for tool-bearing
+        // requests: the fast path is greedy text-only and cannot reproduce the
+        // engine's raw-text tool-call contract (`container.generate` may parse a
+        // call into a `.toolCall` event, CONSUMING the text — see the runner's
+        // `.toolCall` handling), so tool requests must stay on the engine path.
+        guard allowFastPath else { return false }
+        // Family gate: only Gemma-4 is profiled + validated for this bypass, and
+        // its greedy / EOS behavior is only known-good there. Every other family
+        // (different EOS sets, tool/stop conventions) defers to the batched engine.
+        guard modelId.lowercased().contains("gemma") else { return false }
+        // KV quantization: batched-engine admission reserves at the REDUCED
+        // (quantized) per-token KV rate, but `ModelContainer.generate` allocates a
+        // full fp16 KV cache. A fast-path reservation sized at the quantized rate
+        // would under-count ~2x and risk a unified-memory OOM, so whenever KV
+        // quant is active we defer to the engine (which owns the quantized cache).
+        guard !kvQuantEnabled else { return false }
+        // Pure greedy only: temperature 0 and no nucleus / top-k truncation.
+        // (minP / repetition / presence / frequency penalties are not part of
+        // the tokenized submit surface, so temperature + topP + topK fully
+        // characterize "greedy" here.)
+        guard temperature == 0 else { return false }
+        guard topP == nil || topP == 0 else { return false }
+        guard topK == nil || topK == 0 else { return false }
+        // A seed implies sampling intent; greedy ignores it, but treat its
+        // presence as "not the simple greedy case" and defer to the engine.
+        guard seed == nil else { return false }
+        guard maxTokens > 0 else { return false }
+        // Need a real prompt to prefill (a 0-token prompt has no greedy seed).
+        guard promptTokenCount > 0 else { return false }
+        // Context window: the fast path runs a cold prefill of the WHOLE prompt
+        // and decodes up to `maxTokens` against one fresh cache. If that span
+        // exceeds the model's context window, defer to the engine path — it
+        // enforces context limits and emits the precise context-overflow
+        // rejection. `maxContextLength == 0` ⇒ context unknown ⇒ skip this gate
+        // (the remaining gates, incl. the token-budget guard upstream, still apply).
+        if maxContextLength > 0 {
+            guard promptTokenCount + maxTokens <= maxContextLength else { return false }
+        }
+        // No prefix-cache scope: the fast path runs a cold prefill against a
+        // fresh cache and does not participate in the checkpoint / engine prefix
+        // tiers, so a scoped request keeps the engine path to retain cache reuse.
+        guard cacheScope.isEmpty else { return false }
+        // Exclusive: no other in-flight or queued work. Concurrent batched work
+        // would defeat the single-row assumption (shared GPU + KV headroom).
+        guard activeBridgeCount == 0 else { return false }
+        guard pendingRequestCount == 0 else { return false }
+        // And no OTHER fast-path task already running (explicit single-row gate;
+        // belt-and-suspenders with the activeBridgeCount check, since a running
+        // fast path also holds a bridge).
+        guard !fastPathActive else { return false }
+        // Need a live container to generate against.
+        guard hasContainer else { return false }
+        return true
+    }
+
+    // MARK: - Runner
+
+    /// Drive a single greedy request through `ModelContainer.generate` and
+    /// translate its `Generation` events onto the scheduler's `GenerationEvent`
+    /// stream. Mirrors `runBridge`'s lifecycle (admission / first-token / finish
+    /// bookkeeping and terminal `.info` / `.error` mapping) but sources tokens
+    /// from the single-sequence generator instead of `engine.core.streamOutputs`.
+    ///
+    /// The spawned task is tracked in `fastPathTasks[id]` so `cancel` /
+    /// `cancelAll` / `stopCurrentEngine` can tear it down; it removes its own
+    /// handle on completion. The caller (`submitTokenized`) is responsible for
+    /// having inserted the bridge and reserved KV before this runs, and for
+    /// wiring `continuation.onTermination`.
+    func runGreedyFastPath(
+        requestId id: String,
+        container: ModelContainer,
+        promptTokens: [Int],
+        maxTokens: Int,
+        continuation: AsyncStream<GenerationEvent>.Continuation
+    ) {
+        let scheduler = self
+        let promptCount = promptTokens.count
+        let task = Task {
+            // Token-only input (no media). `MLXArray(promptTokens)` is a cheap
+            // host-side copy; the GPU work happens inside `generate`.
+            let lmInput = LMInput(tokens: MLXArray(promptTokens))
+            // temperature 0 ⇒ ArgMaxSampler. topP/topK/penalties left at their
+            // defaults are inert under greedy. maxTokens bounds the decode.
+            let params = GenerateParameters(maxTokens: maxTokens, temperature: 0)
+
+            // Admission ≈ now: prefill is about to begin. Drives the
+            // pending-timeout predicate and starts the prefill-EWMA window.
+            await scheduler.recordAdmission(requestId: id, at: .now)
+
+            let genStream: AsyncStream<Generation>
+            do {
+                genStream = try await container.generate(input: lmInput, parameters: params)
+            } catch {
+                _ = await scheduler.recordFinish(
+                    requestId: id, promptTokens: promptCount,
+                    completionTokens: 0, success: false)
+                continuation.yield(.error(
+                    "fast path generation failed: \(error.localizedDescription)"))
+                continuation.finish()
+                await scheduler.clearFastPathTask(id)
+                return
+            }
+
+            var sawFirstToken = false
+            // Count every streamed chunk as >= 1 completion token. The terminal
+            // `.info` carries the EXACT generation count, but it only arrives on a
+            // clean finish; on cancellation the loop breaks before it, so without
+            // this running tally `recordFinish` would settle at 0 completion tokens
+            // and the coordinator would bill $0 for work already streamed to the
+            // client. `recordFinish` takes max(observed, terminal), so a clean
+            // finish still uses the exact `.info` count (>= the chunk tally).
+            var streamedTokens = 0
+            var terminalCompletion: Int? = nil
+            var reportedPrompt = promptCount
+            // Defensive: the greedy text-only fast path should never see a parsed
+            // tool call (tool requests are kept on the engine path by the caller's
+            // `allowFastPath` gate). If one is surfaced anyway we cannot faithfully
+            // reproduce the engine's raw-text behavior, so we FAIL rather than drop.
+            var sawToolCall = false
+
+            for await gen in genStream {
+                // Cooperative cancellation: a client cancel / model reload cancels
+                // this task; break and let the finish bookkeeping below run.
+                if Task.isCancelled { break }
+                switch gen {
+                case .chunk(let text):
+                    if !sawFirstToken {
+                        sawFirstToken = true
+                        await scheduler.recordFirstToken(requestId: id, at: .now)
+                    }
+                    streamedTokens += 1
+                    if !text.isEmpty {
+                        continuation.yield(.chunk(text))
+                    }
+                case .info(let info):
+                    reportedPrompt = info.promptTokenCount
+                    terminalCompletion = info.generationTokenCount
+                case .toolCall:
+                    // `container.generate` parsed a tool call (and may have
+                    // CONSUMED its text rather than emitting it as `.chunk`s).
+                    // Silently dropping it would lose the call; the engine path
+                    // emits raw text and never `.toolCall`, so we cannot match it
+                    // here. Mark failure and stop.
+                    sawToolCall = true
+                }
+                if sawToolCall { break }
+            }
+
+            let cancelled = Task.isCancelled
+            // Billing-safe completion count: terminal exact count when present,
+            // otherwise the streamed-chunk lower bound (covers cancel + tool-call
+            // failure, where no `.info` arrived).
+            let completionTokens = max(terminalCompletion ?? 0, streamedTokens)
+            let succeeded = !cancelled && !sawToolCall
+            // Reuse the engine bridge's finish bookkeeping: removes the bridge,
+            // updates the decode + prefill EWMA, releases the KV reservation, and
+            // returns billing-safe usage counts (max of observed vs. terminal).
+            let usage = await scheduler.recordFinish(
+                requestId: id,
+                promptTokens: reportedPrompt,
+                completionTokens: completionTokens,
+                success: succeeded)
+
+            // Emit delivered usage (so a listener can bill partial work) before
+            // any terminal error, mirroring the engine bridge.
+            if !succeeded, usage.promptTokens > 0 || usage.completionTokens > 0 {
+                continuation.yield(.info(
+                    promptTokens: usage.promptTokens,
+                    completionTokens: usage.completionTokens,
+                    tokensPerSecond: usage.tps))
+            }
+            if cancelled {
+                continuation.yield(.error("request cancelled"))
+            } else if sawToolCall {
+                continuation.yield(.error(
+                    "fast path does not support tool calls; please retry"))
+            } else {
+                continuation.yield(.info(
+                    promptTokens: usage.promptTokens,
+                    completionTokens: usage.completionTokens,
+                    tokensPerSecond: usage.tps))
+            }
+            continuation.finish()
+            await scheduler.clearFastPathTask(id)
+        }
+        fastPathTasks[id] = task
+    }
+
+    // MARK: - Task tracking / teardown
+
+    /// Remove a finished fast-path task handle. Called by the task itself on
+    /// completion. Safe for an unknown id.
+    func clearFastPathTask(_ id: String) {
+        fastPathTasks.removeValue(forKey: id)
+    }
+
+    /// Cancel the in-flight fast-path task for `id`, if any. Returns true when a
+    /// task existed and was cancelled. The task observes `Task.isCancelled`,
+    /// runs its finish bookkeeping (KV release, bridge removal, terminal events)
+    /// and clears its own handle.
+    @discardableResult
+    func cancelFastPathTask(_ id: String) -> Bool {
+        guard let task = fastPathTasks[id] else { return false }
+        task.cancel()
+        return true
+    }
+
+    /// Cancel every in-flight fast-path task (model reload / `cancelAll`). Each
+    /// task self-removes; callers that also clear `fastPathTasks` (e.g.
+    /// `stopCurrentEngine`) make late `clearFastPathTask` calls harmless no-ops.
+    func cancelAllFastPathTasks() {
+        for task in fastPathTasks.values { task.cancel() }
+    }
+
+    /// Cancel AND fence every in-flight fast-path task — used by
+    /// `stopCurrentEngine` before it nil's `modelContainer` and clears the MLX
+    /// cache. Unlike the engine (which is fenced by `stopAndWait`), a fast-path
+    /// task runs off-engine inside `ModelContainer.generate`, holding and running
+    /// GPU work against the model + its KV cache. If teardown freed that state
+    /// while a task were still mid-`generate`, it could touch released model/MLX
+    /// state. Awaiting each task's value blocks until it has observed
+    /// cancellation, run its finish bookkeeping (KV release + bridge removal +
+    /// terminal events) and dropped its model/iterator references.
+    ///
+    /// The handles are snapshotted first so a self-removing `clearFastPathTask`
+    /// during the awaits cannot mutate the collection being iterated. The await
+    /// suspends the actor so those actor-isolated callbacks make progress; no NEW
+    /// fast path can start meanwhile because `stopCurrentEngine` has already
+    /// nil'd `engine` (every submit path short-circuits on a nil engine).
+    /// Idempotent: a no-op when nothing is in flight.
+    func waitForFastPathTasks() async {
+        let inflight = Array(fastPathTasks.values)
+        guard !inflight.isEmpty else { return }
+        for task in inflight { task.cancel() }
+        for task in inflight { await task.value }
+        fastPathTasks.removeAll()
+    }
+}
+
+// MARK: - Test support
+//
+// Internal + `@testable`-only; dead-code-stripped from production binaries.
+
+extension BatchScheduler {
+    /// Force the B=1 fast-path enablement gate on/off, bypassing the env flags.
+    /// `nil` restores env-driven behavior. Lets a benchmark A/B the fast path vs.
+    /// the batched engine in a single process (mutating `ProcessInfo`'s cached
+    /// environment mid-run is unreliable).
+    func _setForceB1FastPathForTest(_ value: Bool?) {
+        _forceB1FastPathForTest = value
+    }
+
+    /// Test accessor: number of in-flight fast-path tasks currently tracked.
+    func _fastPathTaskCountForTest() -> Int { fastPathTasks.count }
+}


🔴 [CRITICAL] 🧩 370-line fast path implementation with single consumer adds significant complexity

💡 Suggestion: Consider if this optimization justifies the maintenance burden - the fast path duplicates engine functionality, adds new failure modes, and requires careful coordination with the existing batched engine

📊 Score: 4×4 = 16 · Category: over-abstraction

vercel Bot deployed to Preview – d-inference-console-ui-dev June 26, 2026 06:13 View deployment

vercel Bot deployed to Preview – d-inference June 26, 2026 06:13 View deployment

Gajesh2007 force-pushed the feat/gemma4-vmlx-decode-port branch from 269e7fc to 12c7fed Compare June 26, 2026 06:17

vercel Bot deployed to Preview – d-inference-landing June 26, 2026 06:17 View deployment

vercel Bot deployed to Preview – d-inference-console-ui-dev June 26, 2026 06:17 View deployment

vercel Bot deployed to Preview – d-inference June 26, 2026 06:17 View deployment

chatgpt-codex-connector Bot reviewed Jun 26, 2026

View reviewed changes

Gajesh2007 force-pushed the feat/gemma4-vmlx-decode-port branch from 12c7fed to 182b719 Compare June 26, 2026 06:23

vercel Bot deployed to Preview – d-inference-landing June 26, 2026 06:24 View deployment

vercel Bot deployed to Preview – d-inference-console-ui-dev June 26, 2026 06:24 View deployment

vercel Bot deployed to Preview – d-inference June 26, 2026 06:24 View deployment

ethenotethan reviewed Jun 26, 2026

View reviewed changes

Comment thread provider-swift/Tests/ProviderCoreTests/Gemma4DecodeProfileTests.swift

Gajesh2007 force-pushed the feat/gemma4-vmlx-decode-port branch from 182b719 to e0d0c75 Compare June 26, 2026 06:35

Gajesh2007 changed the base branch from feat/add-libs-mlx-submodule to master June 26, 2026 06:35

Gajesh2007 requested a deployment to benchmarks June 26, 2026 06:35 — with GitHub Actions Waiting

vercel Bot deployed to Preview – d-inference-landing June 26, 2026 06:35 View deployment

vercel Bot deployed to Preview – d-inference June 26, 2026 06:36 View deployment

vercel Bot deployed to Preview – d-inference-console-ui-dev June 26, 2026 06:36 View deployment

ethenotethan reviewed Jun 26, 2026

View reviewed changes

Comment thread provider-swift/Tests/ProviderCoreTests/Gemma4DecodeProfileTests.swift

chatgpt-codex-connector Bot reviewed Jun 26, 2026

View reviewed changes

ethenotethan reviewed Jun 26, 2026

View reviewed changes

Comment thread provider-swift/Tests/ProviderCoreTests/Gemma4DecodeProfileTests.swift

feat(provider): bump Gemma4 decode stack to submodule main

a3c9725

Gajesh2007 force-pushed the feat/gemma4-vmlx-decode-port branch from e0d0c75 to a3c9725 Compare June 26, 2026 16:11

Gajesh2007 requested a deployment to benchmarks June 26, 2026 16:11 — with GitHub Actions Waiting

vercel Bot deployed to Preview – d-inference-landing June 26, 2026 16:11 View deployment

vercel Bot deployed to Preview – d-inference-console-ui-dev June 26, 2026 17:43 View deployment

ethenotethan requested changes Jun 26, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 26, 2026

View reviewed changes

ethenotethan reviewed Jun 26, 2026

View reviewed changes

perf(provider): add B1 greedy fast path

6819a45

Gajesh2007 requested a deployment to benchmarks June 26, 2026 18:25 — with GitHub Actions Waiting

vercel Bot deployed to Preview – d-inference-landing June 26, 2026 18:25 View deployment

vercel Bot deployed to Preview – d-inference-console-ui-dev June 26, 2026 18:25 View deployment

vercel Bot deployed to Preview – d-inference June 26, 2026 18:25 View deployment

chatgpt-codex-connector Bot reviewed Jun 26, 2026

View reviewed changes

ethenotethan requested changes Jun 26, 2026

View reviewed changes

fix(provider): reserve Gemma4 video KV and guard live metallib

8e4f59a

Gajesh2007 requested a deployment to benchmarks June 26, 2026 19:13 — with GitHub Actions Waiting

vercel Bot deployed to Preview – d-inference-landing June 26, 2026 19:14 View deployment

vercel Bot deployed to Preview – d-inference-console-ui-dev June 26, 2026 19:14 View deployment

vercel Bot deployed to Preview – d-inference June 26, 2026 19:14 View deployment

ethenotethan requested changes Jun 26, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 26, 2026

View reviewed changes

ethenotethan requested changes Jun 26, 2026

View reviewed changes

fix(provider): harden B1 greedy fast path

1231a12

Gajesh2007 requested a deployment to benchmarks June 26, 2026 20:24 — with GitHub Actions Waiting

vercel Bot deployed to Preview – d-inference-landing June 26, 2026 20:24 View deployment

vercel Bot deployed to Preview – d-inference June 26, 2026 20:24 View deployment

vercel Bot deployed to Preview – d-inference-console-ui-dev June 26, 2026 20:24 View deployment

Gajesh2007 merged commit 9ee24d9 into master Jun 26, 2026
13 of 14 checks passed

ethenotethan requested changes Jun 26, 2026

View reviewed changes

		@@ -1 +1 @@
		Subproject commit 2b4b0d8dd123b7fde4f8f5fbcfd4fe8e3e4fe242
		Subproject commit 461a0ab01c42fdae4d0b400259203ae3c7009224

		@@ -1 +1 @@
		Subproject commit 3c50ad693a7a3fbfbcc1d7ddf834a37e9ee8cf14
		Subproject commit 3b6af6e26b009995e127c9c1921f185c54aed379

		@@ -1 +1 @@
		Subproject commit b1df0f22424f94005685baf79192ff08e9c43eb9
		Subproject commit 7ce9e45b40add0d6e66bc69aeed6bfb896009e04

		@@ -1 +1 @@
		Subproject commit b1df0f22424f94005685baf79192ff08e9c43eb9
		Subproject commit e49e0ff0e43aee0e9e554e1a9120f62f00bcf09c

Uh oh!

Conversation

Gajesh2007 commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results

Before

After

Verification

Notes

Uh oh!

vercel Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

ethenotethan left a comment

Choose a reason for hiding this comment

Automated Code Review — Layr-Labs/d-inference#

Security — ✅ No issues found

Performance — 1 finding(s)

Type_diligence — ✅ No issues found

Additive_complexity — ✅ No issues found

Uh oh!

ethenotethan left a comment

Choose a reason for hiding this comment

Automated Code Review — Layr-Labs/d-inference#

Security — ✅ No issues found

Performance — 1 finding(s)

Type_diligence — ✅ No issues found

Additive_complexity — ✅ No issues found

Uh oh!

Uh oh!

github-actions Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Trust Boundaries Touched

Threat Analysis

New Attack Surface Not Covered by an Existing Threat

Open Findings Resolved

Summary

Uh oh!

ethenotethan left a comment

Choose a reason for hiding this comment

Automated Code Review — Layr-Labs/d-inference#

Security — ✅ No issues found

Performance — 1 finding(s)

Type_diligence — ✅ No issues found

Additive_complexity — ✅ No issues found

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Gajesh2007 commented Jun 26, 2026

Provider-stack Gemma4 benchmark

Architectural Optimization of Sparse Mixture-of-Experts Inference on Apple Silicon\n\nThe deployment of Sparse Mixture-of-Experts (SMoE) models, such as Mixtral 8x7B, presents a unique computational paradox on Unified Memory Architecture (UMA) systems like Apple Silicon...

Uh oh!

ethenotethan left a comment

Choose a reason for hiding this comment

Automated Code Review — Layr-Labs/d-inference#

Security — ✅ No issues found

Performance — 1 finding(s)

Gajesh2007 commented Jun 26, 2026 •

edited

Loading

vercel Bot commented Jun 26, 2026 •

edited

Loading

github-actions Bot commented Jun 26, 2026 •

edited

Loading