feat(provider): adopt Gemma4 vMLX decode stack#470
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
269e7fc to
12c7fed
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 269e7fc751
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if LiveInferenceFixtures.ensureMetallibColocated() == nil { | ||
| Issue.record("mlx.metallib not found near test bundle or in MLX_METALLIB_PATH/SOURCE") | ||
| } |
There was a problem hiding this comment.
Return when metallib bootstrap fails
When this live profile is enabled on a machine without mlx.metallib colocated, the test records an issue but then continues into MLX setup/model execution. LiveInferenceFixtures.ensureMetallibColocated() is documented and used elsewhere as a guard because MLX can crash on the first GPU call without the metallib, so this intended skip/failure path can abort the whole filtered benchmark run instead of exiting cleanly.
Useful? React with 👍 / 👎.
| @@ -1 +1 @@ | |||
| Subproject commit 2b4b0d8dd123b7fde4f8f5fbcfd4fe8e3e4fe242 | |||
| Subproject commit 461a0ab01c42fdae4d0b400259203ae3c7009224 | |||
There was a problem hiding this comment.
Restore Gemma4 video preprocessing
For Gemma4 video_url requests, this submodule bump drops the processor path that consumed input.videos: the previous Gemma4.swift built LMInput.ProcessedVideo and expanded <|video|> placeholders, while the new Gemma4.swift has no input.videos/ProcessedVideo handling. VLMRequestInference.buildUserInput still attaches videos and routes media requests through the container's processor, so advertised Gemma4 video inputs will be sent to generation without frame features (or hit placeholder/feature mismatches) rather than being answered from the video.
Useful? React with 👍 / 👎.
12c7fed to
182b719
Compare
ethenotethan
left a comment
There was a problem hiding this comment.
Automated Code Review — Layr-Labs/d-inference#
Verdict: COMMENT
Security — ✅ No issues found
Performance — 1 finding(s)
- 🔵 [INFO]
provider-swift/Tests/ProviderCoreTests/Gemma4DecodeProfileTests.swift:93— Array grows in loop without pre-allocation despite reserveCapacity- Suggestion: Move reserveCapacity(64) before the loop or use a fixed-size array since the loop count is known (64 iterations)
Type_diligence — ✅ No issues found
Additive_complexity — ✅ No issues found
1 finding(s) total, 0 blocking. Verdict: COMMENT.
🤖 Automated review by Centaur · DAR-186
ethenotethan
left a comment
There was a problem hiding this comment.
Automated Code Review — Layr-Labs/d-inference#
Verdict: COMMENT
Security — ✅ No issues found
Performance — 1 finding(s)
- 🔵 [INFO]
provider-swift/Tests/ProviderCoreTests/Gemma4DecodeProfileTests.swift:93— Array grows in loop without pre-allocation despite known capacity- Suggestion: Move reserveCapacity(64) before the loop or initialize with capacity: var generated = [Int](reserving: 64)
Type_diligence — ✅ No issues found
Additive_complexity — ✅ No issues found
1 finding(s) total, 0 blocking. Verdict: COMMENT.
🤖 Automated review by Centaur · DAR-186
182b719 to
e0d0c75
Compare
|
This PR introduces a performance-oriented B=1 greedy fast-path for single exclusive inference requests; the security posture is neutral-to-slightly-positive with one new attack surface item worth tracking. Trust Boundaries Touched
Threat AnalysisT-007 / T-027 — Model weight substitution / manipulated outputs T-008 — Provider sends plaintext SSE chunks on encryption failure (SEC-016) T-028 — Residual inference data in GPU memory T-041 — Cross-tenant prefix-cache sharing / TTFT timing oracle T-010 — Cancellation not propagated to inference engine New Attack Surface Not Covered by an Existing ThreatEnv-flag side-channel for operational state inference (
( This error string leaks to the consumer that the provider is currently processing exactly one exclusive greedy request. Under the threat model (ADV-002, malicious consumer), a consumer who can probe retry behavior could infer:
Recommendation: use a generic retryable error string (e.g. the existing Open Findings ResolvedNone. SEC-007, SEC-016, SEC-035 remain open and are untouched by this PR. SummaryThe GPU teardown fence ( 🔐 Threat model: |
ethenotethan
left a comment
There was a problem hiding this comment.
Automated Code Review — Layr-Labs/d-inference#
Verdict: COMMENT
Security — ✅ No issues found
Performance — 1 finding(s)
- 🔵 [INFO]
provider-swift/Tests/ProviderCoreTests/Gemma4DecodeProfileTests.swift:94— Array grows in loop without pre-allocation despite reserveCapacity- Suggestion: Move reserveCapacity(64) before the loop or use a fixed-size array since the loop count is known (64 iterations)
Type_diligence — ✅ No issues found
Additive_complexity — ✅ No issues found
1 finding(s) total, 0 blocking. Verdict: COMMENT.
🤖 Automated review by Centaur · DAR-186
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e0d0c75412
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| @@ -1 +1 @@ | |||
| Subproject commit 3c50ad693a7a3fbfbcc1d7ddf834a37e9ee8cf14 | |||
| Subproject commit 3b6af6e26b009995e127c9c1921f185c54aed379 | |||
There was a problem hiding this comment.
Keep mlx-swift compatible with the pinned Swift toolchain
This gitlink now points the provider's local ../libs/mlx-swift dependency at a commit whose Package.swift declares // swift-tools-version: 6.3;(experimentalCGen), while the repo still documents and pins Swift 6.1 in mise.toml and provider-swift/Package.swift. Developers following the documented mise install && make provider-build path, and any CI/release runner with Swift <6.3, will fail during package loading with a minimum tools-version error before any provider code compiles; either keep this submodule on a Swift-6.1-compatible commit or update the repo/toolchain/workflows together.
Useful? React with 👍 / 👎.
| @@ -1 +1 @@ | |||
| Subproject commit b1df0f22424f94005685baf79192ff08e9c43eb9 | |||
| Subproject commit 7ce9e45b40add0d6e66bc69aeed6bfb896009e04 | |||
There was a problem hiding this comment.
Preserve Gemma4's bidirectional vision mask
For Gemma4 image_url requests, this submodule bump removes the VLM text path that built a visualTokenMask and passed it as imageTokenMask so use_bidirectional_attention == "vision" could overlay bidirectional attention within image-token spans; the new Gemma4 path has no imageTokenMask/useBidirectionalAttention handling and always builds ordinary causal masks. That means earlier soft image tokens cannot attend to later patches during prefill, corrupting the visual features the language model consumes even though image inputs are still accepted.
Useful? React with 👍 / 👎.
| @@ -1 +1 @@ | |||
| Subproject commit b1df0f22424f94005685baf79192ff08e9c43eb9 | |||
| Subproject commit 7ce9e45b40add0d6e66bc69aeed6bfb896009e04 | |||
There was a problem hiding this comment.
Keep Gemma4 visual delimiter tokens
For Gemma4 image prompts, the previous processor expanded each <|image|> placeholder as boi + image_token * count + eoi, matching the model config's boi_token_id/eoi_token_id; this submodule bump replaces that with only repeated image tokens and the new config no longer reads those delimiter ids. Requests that include image_url are therefore still routed to Gemma4 but the visual span is missing the boundary tokens the model was trained to see, so image understanding can degrade or fail even when the feature count matches.
Useful? React with 👍 / 👎.
Provider-stack Gemma4 benchmarkRan a release executable depending on
Notes:
Excerpt:
|
ethenotethan
left a comment
There was a problem hiding this comment.
Automated Code Review — Layr-Labs/d-inference#
Verdict: COMMENT
Security — ✅ No issues found
Performance — 1 finding(s)
- 🔵 [INFO]
provider-swift/Tests/ProviderCoreTests/Gemma4DecodeProfileTests.swift:94— Array grows in loop without pre-allocation despite reserveCapacity- Suggestion: Move reserveCapacity(64) before the loop or use a fixed-size array since the loop count is known (64 iterations)
Type_diligence — ✅ No issues found
Additive_complexity — ✅ No issues found
1 finding(s) total, 0 blocking. Verdict: COMMENT.
🤖 Automated review by Centaur · DAR-186
e0d0c75 to
a3c9725
Compare
ethenotethan
left a comment
There was a problem hiding this comment.
Automated Code Review — Layr-Labs/d-inference#
Verdict: REQUEST_CHANGES
Security — ✅ No issues found
Performance — 2 finding(s) (1 blocking)
- 🔵 [INFO]
provider-swift/Tests/ProviderCoreTests/ContinuousBatchingLiveTests.swift:235— Array.prefix(while:) creates new array copies in hot path- Suggestion: Use iterators or indices to avoid copying arrays when checking for EOS tokens and computing prefix matches
- 🟡 [MEDIUM]
provider-swift/Tests/ProviderCoreTests/Gemma4DecodeProfileTests.swift:94-95— Array grows in loop without pre-allocation despite reserveCapacity- Suggestion: Use generated.append(token) directly since capacity is already reserved, or pre-allocate with Array(repeating:count:) and use index-based assignment
Type_diligence — ✅ No issues found
Additive_complexity — ✅ No issues found
2 finding(s) total, 1 blocking. Verdict: REQUEST_CHANGES.
🤖 Automated review by Centaur · DAR-186
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e0b67c5f45
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| @@ -1 +1 @@ | |||
| Subproject commit b1df0f22424f94005685baf79192ff08e9c43eb9 | |||
| Subproject commit e49e0ff0e43aee0e9e554e1a9120f62f00bcf09c | |||
There was a problem hiding this comment.
Reserve Gemma video KV for every sampled frame
With this submodule bump, Gemma4 video preprocessing now samples up to 32 frames and expands each <|video|> to config.imageSeqLength soft tokens per frame, but the provider still reserves a flat 4,096 KV tokens per video in VLMRequestInference.projectedKVTokens (visionTokensPerVideo and the .videoURL branch). For Gemma configs with ~256/280 soft tokens per frame, a full 32-frame clip needs roughly 8–9k vision tokens before text/output, so concurrent video requests can pass the 90% KV cap with about half their actual prompt KV unreserved and then OOM or destabilize the provider during generation.
Useful? React with 👍 / 👎.
| ) | ||
| ) | ||
| func gemma4VLMLongContextMixedB3() async throws { | ||
| try ensureMetallibAvailable() |
There was a problem hiding this comment.
Stop after missing metallib in live tests
When these new Gemma live tests are enabled on a machine without a colocated mlx.metallib, ensureMetallibAvailable() only records an issue and returns; execution then immediately proceeds into MLX.GPU.set and model loading. LiveInferenceFixtures.ensureMetallibColocated() documents that callers should skip instead of reaching the first GPU call, and the new Gemma4DecodeProfileTests path does return on nil, so these added ContinuousBatching tests can still abort a filtered benchmark run rather than exiting cleanly.
Useful? React with 👍 / 👎.
ethenotethan
left a comment
There was a problem hiding this comment.
Automated Code Review — Layr-Labs/d-inference#
Verdict: COMMENT
Security — ✅ No issues found
Performance — ✅ No issues found
Type_diligence — ✅ No issues found
Additive_complexity — ✅ No issues found
✅ All four passes clean. No issues found.
🤖 Automated review by Centaur · DAR-186
Provider B=1 fast path benchmarkAdded an env-gated B=1 greedy fast path ( Corrected PR pins during this run:
Release provider-stack result (Gemma-4-26B-A4B 8-bit)
Other details:
This closes the B=1 provider path gap to the raw model loop (~75-79 tok/s) while keeping the normal batched engine path as the default unless explicitly enabled by env. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6819a45854
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let kvOutcome = await reserveKVForRequest( | ||
| requestId: id, | ||
| requestTokens: requestBudget, | ||
| reservationTokens: requestBudget, | ||
| restorePlanned: false |
There was a problem hiding this comment.
Reserve fp16 KV for the fast path
When the B=1 path is enabled on a KV-quantized model, this branch reserves through reserveKVForRequest, which charges kvBytesPerToken for the batched/quantized cache. The request then runs through ModelContainer.generate, the same non-batched path that reserveVisionRequest explicitly accounts with fp16KVBytesPerToken; as a result the fast-path request can hold substantially more KV than it reserved and let later admissions pass the 90% cap, causing OOMs under concurrent traffic. Use the fp16 byte reservation or disable this path when KV quantization is active.
Useful? React with 👍 / 👎.
| Task { await scheduler.cancel(requestId: id) } | ||
| } | ||
| } | ||
| return stream |
There was a problem hiding this comment.
Keep fast-path requests in the concurrency gate
This returns before planner.admit, so the planner never counts the direct ModelContainer.generate request as active. While that request is decoding, a later non-fast-path submit can still be queued and scheduled because BatchQueuePlanner sees active.count == 0 even when maxConcurrentRequests == 1, which breaks the branch's “single exclusive” assumption and can overlap engine work with the fast path. Either register the fast-path request with the planner or block new engine admissions until it finishes.
Useful? React with 👍 / 👎.
| cancelAllFastPathTasks() | ||
| fastPathTasks.removeAll() |
There was a problem hiding this comment.
Wait for fast-path tasks before clearing MLX state
On unload/reload or a liveness self-restart during a fast-path generation, this only cancels the task and immediately drops the handle before teardown continues to nil the container and clear MLX memory. Swift cancellation is cooperative, so the task may still be inside container.generate or waiting for its next event while the old model/KV memory is being released, risking crashes or use-after-teardown; keep the handles and await/fence those tasks before clearing model resources.
Useful? React with 👍 / 👎.
| let lmInput = LMInput(tokens: MLXArray(promptTokens)) | ||
| // temperature 0 ⇒ ArgMaxSampler. topP/topK/penalties left at their | ||
| // defaults are inert under greedy. maxTokens bounds the decode. | ||
| let params = GenerateParameters(maxTokens: maxTokens, temperature: 0) |
There was a problem hiding this comment.
Pass effective EOS tokens through the fast path
When the fast path is enabled for models whose stop tokens are augmented at load time, such as GPT-OSS/Harmony via effectiveEOSTokenIds, this direct ModelContainer.generate call does not receive the same EOS set that the batched Scheduler is built with. Those requests can therefore continue past model-specific return/call/end tokens until maxTokens or leak control tokens, while the engine path would stop correctly; thread the effective EOS set into the fast path or restrict it to models whose container config already contains the full stop set.
Useful? React with 👍 / 👎.
| case .chunk(let text): | ||
| if !sawFirstToken { | ||
| sawFirstToken = true | ||
| await scheduler.recordFirstToken(requestId: id, at: .now) | ||
| } |
There was a problem hiding this comment.
Count streamed tokens before cancellation
If a client disconnects after receiving chunks but before ModelContainer.generate emits its terminal .info, the cancellation path breaks out with completionTokens still at 0 because chunk handling only records first-token timing. recordFinish then reports/bills zero completion tokens even though text was already delivered; update progress as chunks arrive or otherwise preserve delivered token counts before honoring cancellation.
Useful? React with 👍 / 👎.
| topK: topK, | ||
| seed: seed, | ||
| maxTokens: maxTokens, | ||
| cacheScope: cacheScope | ||
| ) |
There was a problem hiding this comment.
Preserve context-window checks on the fast path
When the fast path is enabled, eligibility is decided without the prompt length or maxContextLength, and this branch returns before the planner's maxTokensPerBatch/context-window rejection runs. A greedy request whose prompt is longer than the model context but still fits the memory token budget can therefore be sent directly to ModelContainer.generate instead of producing the deterministic context error the engine path emits, risking runtime failures or malformed output for over-context prompts.
Useful? React with 👍 / 👎.
ethenotethan
left a comment
There was a problem hiding this comment.
Automated Code Review — Layr-Labs/d-inference#
Verdict: REQUEST_CHANGES
Security — 1 finding(s)
- 🔵 [INFO]
provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:47-49— Environment variable values read without validation- Suggestion: Validate environment variable values are exactly "1" rather than using == comparison which could match unintended values
Performance — 2 finding(s) (1 blocking)
- 🟡 [MEDIUM]
provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:139— Synchronous container.generate call blocks the actor on hot path- Suggestion: Move the container.generate call outside the actor context or use async/await properly to avoid blocking the scheduler actor
- 🔵 [INFO]
provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:155-175— Unbounded string accumulation in generation loop- Suggestion: Pre-allocate string buffer capacity or use streaming approach to avoid repeated string reallocations during text generation
Type_diligence — ✅ No issues found
Additive_complexity — 1 finding(s) (1 blocking)
- 🟡 [MEDIUM]
provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:1-279— New 279-line fast path implementation duplicates existing functionality- Suggestion: Consider if this optimization justifies the added complexity. The fast path reimplements generation logic that already exists in the batched engine, adding ~300 lines of code and new state tracking for a performance optimization that only applies to single greedy requests.
4 finding(s) total, 2 blocking. Verdict: REQUEST_CHANGES.
🤖 Automated review by Centaur · DAR-186
Review fixes: media KV reservation + live metallib guardsAddressed two provider-side review findings:
Verification:
|
ethenotethan
left a comment
There was a problem hiding this comment.
Automated Code Review — Layr-Labs/d-inference#
Verdict: REQUEST_CHANGES
Security — 1 finding(s)
- 🔵 [INFO]
provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:47-49— Environment variable values read without validation- Suggestion: Validate environment variable values are exactly "1" rather than using string equality check that could match unintended values
Performance — 2 finding(s) (1 blocking)
- 🟡 [MEDIUM]
provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:158— Synchronous container.generate call blocks the actor on hot path- Suggestion: Use async/await pattern or move the blocking operation off the actor's execution context
- 🔵 [INFO]
provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:225— Dictionary grows without bounds as fast path tasks accumulate- Suggestion: Add periodic cleanup or size limits to fastPathTasks dictionary to prevent memory growth
Type_diligence — ✅ No issues found
Additive_complexity — 3 finding(s) (1 blocking)
- 🟡 [MEDIUM]
provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:1-279— New 279-line fast path adds complexity for marginal performance gain- Suggestion: Consider if the 20% TPS improvement justifies adding a parallel execution path with its own task tracking, eligibility logic, and test overrides. The batched engine already handles B=1 cases.
- 🔵 [INFO]
provider-swift/Sources/ProviderCore/Inference/BatchScheduler.swift:264-278— Three new state fields for fast path tracking- Suggestion: The fastPathTasks dictionary and test override field add state management burden. Consider if the performance benefit outweighs the maintenance cost of parallel execution paths.
- 🔵 [INFO]
provider-swift/Tests/ProviderCoreTests/B1GreedyFastPathTests.swift:1-294— 294-line test file for eligibility policy that could be simpler- Suggestion: The eligibility logic is a pure function with 9 parameters. Consider if this level of parameterization and testing is proportional to the feature's complexity.
6 finding(s) total, 2 blocking. Verdict: REQUEST_CHANGES.
🤖 Automated review by Centaur · DAR-186
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8e4f59a49d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| case .info(let info): | ||
| reportedPrompt = info.promptTokenCount | ||
| completionTokens = info.generationTokenCount | ||
| case .toolCall: |
There was a problem hiding this comment.
When the env-gated fast path handles a default-greedy request that includes tools, this branch drops the .toolCall events produced by ModelContainer.generate. MultiModelBatchSchedulerEngine.streamChatCompletion only builds downstream tool calls from raw .chunk text via BatchedToolStreamHandler, and submitTokenized does not pass a “has tools” flag into eligibility, so tool requests can enter this path and complete with no tool call/content. Either disqualify tool requests from the fast path or translate these events into the caller’s expected stream.
Useful? React with 👍 / 👎.
ethenotethan
left a comment
There was a problem hiding this comment.
Automated Code Review — Layr-Labs/d-inference#
Verdict: REQUEST_CHANGES
Security — 1 finding(s)
- 🔵 [INFO]
provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:47-49— Environment variable values not validated- Suggestion: Validate environment variable values are exactly "1" rather than using string equality check that could match unintended values
Performance — 3 finding(s) (1 blocking)
- 🟡 [MEDIUM]
provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:139— Synchronous container.generate call blocks the actor on hot path- Suggestion: Use async/await pattern or move the blocking operation off the actor's execution context
- 🔵 [INFO]
provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:155— MLXArray(promptTokens) creates unbounded host-side copy- Suggestion: Pre-allocate or stream tokens to avoid large memory copies for very long prompts
- 🔵 [INFO]
provider-swift/Sources/ProviderCore/Inference/VLMRequestInference.swift:133— Multiplication computed repeatedly in hot path- Suggestion: Pre-compute VLMRequestInference.maxVideoFramesSampled * VLMRequestInference.visionTokensPerImage as a static constant
Type_diligence — ✅ No issues found
Additive_complexity — 2 finding(s) (1 blocking)
- 🟡 [MEDIUM]
provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:1-279— New 279-line fast path implementation duplicates existing functionality- Suggestion: Consider if this optimization justifies the added complexity. The fast path reimplements generation logic that already exists in the batched engine, adding ~300 lines of code and a parallel execution path that must be maintained alongside the existing engine. The performance gain (75 vs 63 TPS) may not justify doubling the inference code paths.
- 🔵 [INFO]
provider-swift/Sources/ProviderCore/Inference/BatchScheduler.swift:264-278— Additional state tracking for parallel execution path- Suggestion: The fastPathTasks dictionary and _forceB1FastPathForTest flag add state management complexity. Consider if the performance benefit justifies maintaining two separate execution paths with their own lifecycle management.
6 finding(s) total, 2 blocking. Verdict: REQUEST_CHANGES.
🤖 Automated review by Centaur · DAR-186
B=1 fast-path safety hardeningAddressed the B1 fast-path review findings conservatively:
Verification:
|
ethenotethan
left a comment
There was a problem hiding this comment.
Automated Code Review — Layr-Labs/d-inference#
Verdict: REQUEST_CHANGES
Security — ✅ No issues found
Performance — 2 finding(s) (1 blocking)
- 🟡 [MEDIUM]
provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:206— Synchronous container.generate call blocks actor on hot path- Suggestion: Use async/await pattern or move the blocking call off the actor's execution context
- 🔵 [INFO]
provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:370— Task spawned without explicit cancellation handling in parent context- Suggestion: Consider using TaskGroup or structured concurrency to ensure proper task lifecycle management
Type_diligence — ✅ No issues found
Additive_complexity — 1 finding(s) (1 blocking)
- 🔴 [CRITICAL]
provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:1-370— 370-line fast path implementation duplicates engine functionality with single consumer- Suggestion: Consider if this optimization justifies the maintenance burden of duplicating request lifecycle, KV management, and error handling that already exists in the batched engine
3 finding(s) total, 2 blocking. Verdict: REQUEST_CHANGES.
🤖 Automated review by Centaur · DAR-186
ethenotethan
left a comment
There was a problem hiding this comment.
Automated Code Review — Layr-Labs/d-inference#
Verdict: REQUEST_CHANGES
Security — ✅ No issues found
Performance — 3 finding(s) (1 blocking)
- 🟡 [MEDIUM]
provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:206— Synchronous container.generate call blocks actor on hot path- Suggestion: Use async/await pattern or move the blocking call off the actor's execution context
- 🔵 [INFO]
provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:190— MLXArray(promptTokens) creates unbounded allocation in hot path- Suggestion: Pre-allocate or reuse MLXArray instances, especially for large prompt token arrays
- 🔵 [INFO]
provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:340-350— Task spawned without proper resource limits or cancellation timeout- Suggestion: Add timeout or resource limits to the spawned Task to prevent unbounded resource usage
Type_diligence — ✅ No issues found
Additive_complexity — 1 finding(s) (1 blocking)
- 🔴 [CRITICAL]
provider-swift/Sources/ProviderCore/Inference/BatchScheduler+B1FastPath.swift:1-370— 370-line fast path implementation with single consumer adds significant complexity- Suggestion: Consider if this optimization justifies the maintenance burden - the fast path duplicates engine functionality, adds new failure modes, and requires careful coordination with the existing batched engine
4 finding(s) total, 2 blocking. Verdict: REQUEST_CHANGES.
🤖 Automated review by Centaur · DAR-186
| await scheduler.recordAdmission(requestId: id, at: .now) | ||
|
|
||
| let genStream: AsyncStream<Generation> | ||
| do { |
There was a problem hiding this comment.
🟡 [MEDIUM] ⚡ Synchronous container.generate call blocks actor on hot path
💡 Suggestion: Use async/await pattern or move the blocking call off the actor's execution context
📊 Score: 3×4 = 12 · Category: blocking_io
| promptTokens: [Int], | ||
| maxTokens: Int, | ||
| continuation: AsyncStream<GenerationEvent>.Continuation | ||
| ) { |
There was a problem hiding this comment.
🔵 [INFO] ⚡ MLXArray(promptTokens) creates unbounded allocation in hot path
💡 Suggestion: Pre-allocate or reuse MLXArray instances, especially for large prompt token arrays
📊 Score: 2×3 = 6 · Category: unbounded_allocations
| /// The handles are snapshotted first so a self-removing `clearFastPathTask` | ||
| /// during the awaits cannot mutate the collection being iterated. The await | ||
| /// suspends the actor so those actor-isolated callbacks make progress; no NEW | ||
| /// fast path can start meanwhile because `stopCurrentEngine` has already | ||
| /// nil'd `engine` (every submit path short-circuits on a nil engine). | ||
| /// Idempotent: a no-op when nothing is in flight. | ||
| func waitForFastPathTasks() async { | ||
| let inflight = Array(fastPathTasks.values) | ||
| guard !inflight.isEmpty else { return } | ||
| for task in inflight { task.cancel() } | ||
| for task in inflight { await task.value } |
There was a problem hiding this comment.
🔵 [INFO] ⚡ Task spawned without proper resource limits or cancellation timeout
💡 Suggestion: Add timeout or resource limits to the spawned Task to prevent unbounded resource usage
📊 Score: 2×3 = 6 · Category: unbounded_goroutines
| // Copyright © 2026 Eigen Labs. | ||
| // | ||
| // BatchScheduler B=1 greedy fast path: an env-gated bypass of the | ||
| // continuous-batching engine for a single exclusive, greedy (temperature == 0) | ||
| // request. | ||
| // | ||
| // The batched engine carries continuous-batching overhead a single-row decode | ||
| // does not need: batch tensor (de)allocation per step, the scheduler step loop, | ||
| // the output collector, and cross-thread `RequestOutput` streaming. On Gemma-4 | ||
| // that overhead shows up as ~63 TPS through `BatchedEngine` vs ~75 TPS for the | ||
| // raw single-sequence loop (see `Tests/.../Gemma4DecodeProfileTests.swift`). | ||
| // | ||
| // When exactly one request is in flight and it is pure greedy, we run that | ||
| // single-sequence decode through `ModelContainer.generate` — the SAME | ||
| // concurrency-safe path the VLM media path (`VLMRequestInference`) already uses | ||
| // alongside the engine. `ModelContainer.generate` holds the container | ||
| // exclusively only for the prefill, then streams the decode (asyncEval | ||
| // pipelined) on its own task. We translate its `Generation` events to our | ||
| // `GenerationEvent` stream. | ||
| // | ||
| // Safety posture: | ||
| // * OFF by default; opt in with an env flag. | ||
| // * Conservative gate — anything that isn't a single exclusive greedy text | ||
| // request falls back to the batched engine, so the engine path's behavior | ||
| // is never altered. | ||
| // * KV byte budget is still reserved/released, and bridge bookkeeping | ||
| // (heartbeats, decode/prefill EWMA, billing-safe usage) is preserved via | ||
| // the SAME `recordAdmission` / `recordFirstToken` / `recordFinish` methods | ||
| // the engine bridge uses. | ||
| // * The in-flight task is tracked so `cancel` / `cancelAll` / | ||
| // `stopCurrentEngine` tear it down deterministically. | ||
|
|
||
| import Foundation | ||
| import MLX | ||
| import MLXLMCommon | ||
|
|
||
| extension BatchScheduler { | ||
|
|
||
| // MARK: - Env gate | ||
|
|
||
| /// True when the operator opted into the B=1 greedy fast path. Two flags are | ||
| /// accepted: `DARKBLOOM_B1_GREEDY_FAST_PATH` (generic) and | ||
| /// `DARKBLOOM_GEMMA_B1_FAST_PATH` (Gemma-targeted alias). Either set to `"1"` | ||
| /// enables it. Read per-call (cheap) so tests can toggle it via the | ||
| /// environment without restarting the scheduler. | ||
| static func b1GreedyFastPathEnabled() -> Bool { | ||
| let env = ProcessInfo.processInfo.environment | ||
| return env["DARKBLOOM_B1_GREEDY_FAST_PATH"] == "1" | ||
| || env["DARKBLOOM_GEMMA_B1_FAST_PATH"] == "1" | ||
| } | ||
|
|
||
| // MARK: - Eligibility | ||
|
|
||
| /// Whether this request can take the single-exclusive greedy fast path. | ||
| /// | ||
| /// MUST be evaluated BEFORE the request's own bridge is inserted into | ||
| /// `activeBridges` — the exclusivity check reads `activeBridges.count`. | ||
| /// Every condition is conservative: a miss simply defers to the batched | ||
| /// engine, so this can only shrink the set of requests the fast path serves, | ||
| /// never change the engine path's correctness. The decision itself is a pure | ||
| /// function (`b1FastPathEligiblePure`) so it can be unit-tested exhaustively | ||
| /// without a loaded model. | ||
| func b1FastPathEligible( | ||
| temperature: Float, | ||
| topP: Float?, | ||
| topK: Int?, | ||
| seed: UInt64?, | ||
| promptTokenCount: Int, | ||
| maxTokens: Int, | ||
| cacheScope: String, | ||
| allowFastPath: Bool | ||
| ) -> Bool { | ||
| Self.b1FastPathEligiblePure( | ||
| // Test override wins when set; otherwise consult the env flags. | ||
| enabled: _forceB1FastPathForTest ?? Self.b1GreedyFastPathEnabled(), | ||
| allowFastPath: allowFastPath, | ||
| modelId: modelId, | ||
| kvQuantEnabled: kvQuantEnabled, | ||
| temperature: temperature, | ||
| topP: topP, | ||
| topK: topK, | ||
| seed: seed, | ||
| promptTokenCount: promptTokenCount, | ||
| maxTokens: maxTokens, | ||
| maxContextLength: maxContextLength, | ||
| cacheScope: cacheScope, | ||
| activeBridgeCount: activeBridges.count, | ||
| pendingRequestCount: pendingRequestCount, | ||
| fastPathActive: !fastPathTasks.isEmpty, | ||
| hasContainer: modelContainer != nil | ||
| ) | ||
| } | ||
|
|
||
| /// Pure eligibility policy for the B=1 greedy fast path. No actor state — all | ||
| /// inputs are parameters — so it is fully unit-testable. Order is irrelevant | ||
| /// to the result (all conditions must hold), but kept cheapest-first. | ||
| static func b1FastPathEligiblePure( | ||
| enabled: Bool, | ||
| allowFastPath: Bool, | ||
| modelId: String, | ||
| kvQuantEnabled: Bool, | ||
| temperature: Float, | ||
| topP: Float?, | ||
| topK: Int?, | ||
| seed: UInt64?, | ||
| promptTokenCount: Int, | ||
| maxTokens: Int, | ||
| maxContextLength: Int, | ||
| cacheScope: String, | ||
| activeBridgeCount: Int, | ||
| pendingRequestCount: Int, | ||
| fastPathActive: Bool, | ||
| hasContainer: Bool | ||
| ) -> Bool { | ||
| guard enabled else { return false } | ||
| // Caller opt-in. The engine consumer clears this for tool-bearing | ||
| // requests: the fast path is greedy text-only and cannot reproduce the | ||
| // engine's raw-text tool-call contract (`container.generate` may parse a | ||
| // call into a `.toolCall` event, CONSUMING the text — see the runner's | ||
| // `.toolCall` handling), so tool requests must stay on the engine path. | ||
| guard allowFastPath else { return false } | ||
| // Family gate: only Gemma-4 is profiled + validated for this bypass, and | ||
| // its greedy / EOS behavior is only known-good there. Every other family | ||
| // (different EOS sets, tool/stop conventions) defers to the batched engine. | ||
| guard modelId.lowercased().contains("gemma") else { return false } | ||
| // KV quantization: batched-engine admission reserves at the REDUCED | ||
| // (quantized) per-token KV rate, but `ModelContainer.generate` allocates a | ||
| // full fp16 KV cache. A fast-path reservation sized at the quantized rate | ||
| // would under-count ~2x and risk a unified-memory OOM, so whenever KV | ||
| // quant is active we defer to the engine (which owns the quantized cache). | ||
| guard !kvQuantEnabled else { return false } | ||
| // Pure greedy only: temperature 0 and no nucleus / top-k truncation. | ||
| // (minP / repetition / presence / frequency penalties are not part of | ||
| // the tokenized submit surface, so temperature + topP + topK fully | ||
| // characterize "greedy" here.) | ||
| guard temperature == 0 else { return false } | ||
| guard topP == nil || topP == 0 else { return false } | ||
| guard topK == nil || topK == 0 else { return false } | ||
| // A seed implies sampling intent; greedy ignores it, but treat its | ||
| // presence as "not the simple greedy case" and defer to the engine. | ||
| guard seed == nil else { return false } | ||
| guard maxTokens > 0 else { return false } | ||
| // Need a real prompt to prefill (a 0-token prompt has no greedy seed). | ||
| guard promptTokenCount > 0 else { return false } | ||
| // Context window: the fast path runs a cold prefill of the WHOLE prompt | ||
| // and decodes up to `maxTokens` against one fresh cache. If that span | ||
| // exceeds the model's context window, defer to the engine path — it | ||
| // enforces context limits and emits the precise context-overflow | ||
| // rejection. `maxContextLength == 0` ⇒ context unknown ⇒ skip this gate | ||
| // (the remaining gates, incl. the token-budget guard upstream, still apply). | ||
| if maxContextLength > 0 { | ||
| guard promptTokenCount + maxTokens <= maxContextLength else { return false } | ||
| } | ||
| // No prefix-cache scope: the fast path runs a cold prefill against a | ||
| // fresh cache and does not participate in the checkpoint / engine prefix | ||
| // tiers, so a scoped request keeps the engine path to retain cache reuse. | ||
| guard cacheScope.isEmpty else { return false } | ||
| // Exclusive: no other in-flight or queued work. Concurrent batched work | ||
| // would defeat the single-row assumption (shared GPU + KV headroom). | ||
| guard activeBridgeCount == 0 else { return false } | ||
| guard pendingRequestCount == 0 else { return false } | ||
| // And no OTHER fast-path task already running (explicit single-row gate; | ||
| // belt-and-suspenders with the activeBridgeCount check, since a running | ||
| // fast path also holds a bridge). | ||
| guard !fastPathActive else { return false } | ||
| // Need a live container to generate against. | ||
| guard hasContainer else { return false } | ||
| return true | ||
| } | ||
|
|
||
| // MARK: - Runner | ||
|
|
||
| /// Drive a single greedy request through `ModelContainer.generate` and | ||
| /// translate its `Generation` events onto the scheduler's `GenerationEvent` | ||
| /// stream. Mirrors `runBridge`'s lifecycle (admission / first-token / finish | ||
| /// bookkeeping and terminal `.info` / `.error` mapping) but sources tokens | ||
| /// from the single-sequence generator instead of `engine.core.streamOutputs`. | ||
| /// | ||
| /// The spawned task is tracked in `fastPathTasks[id]` so `cancel` / | ||
| /// `cancelAll` / `stopCurrentEngine` can tear it down; it removes its own | ||
| /// handle on completion. The caller (`submitTokenized`) is responsible for | ||
| /// having inserted the bridge and reserved KV before this runs, and for | ||
| /// wiring `continuation.onTermination`. | ||
| func runGreedyFastPath( | ||
| requestId id: String, | ||
| container: ModelContainer, | ||
| promptTokens: [Int], | ||
| maxTokens: Int, | ||
| continuation: AsyncStream<GenerationEvent>.Continuation | ||
| ) { | ||
| let scheduler = self | ||
| let promptCount = promptTokens.count | ||
| let task = Task { | ||
| // Token-only input (no media). `MLXArray(promptTokens)` is a cheap | ||
| // host-side copy; the GPU work happens inside `generate`. | ||
| let lmInput = LMInput(tokens: MLXArray(promptTokens)) | ||
| // temperature 0 ⇒ ArgMaxSampler. topP/topK/penalties left at their | ||
| // defaults are inert under greedy. maxTokens bounds the decode. | ||
| let params = GenerateParameters(maxTokens: maxTokens, temperature: 0) | ||
|
|
||
| // Admission ≈ now: prefill is about to begin. Drives the | ||
| // pending-timeout predicate and starts the prefill-EWMA window. | ||
| await scheduler.recordAdmission(requestId: id, at: .now) | ||
|
|
||
| let genStream: AsyncStream<Generation> | ||
| do { | ||
| genStream = try await container.generate(input: lmInput, parameters: params) | ||
| } catch { | ||
| _ = await scheduler.recordFinish( | ||
| requestId: id, promptTokens: promptCount, | ||
| completionTokens: 0, success: false) | ||
| continuation.yield(.error( | ||
| "fast path generation failed: \(error.localizedDescription)")) | ||
| continuation.finish() | ||
| await scheduler.clearFastPathTask(id) | ||
| return | ||
| } | ||
|
|
||
| var sawFirstToken = false | ||
| // Count every streamed chunk as >= 1 completion token. The terminal | ||
| // `.info` carries the EXACT generation count, but it only arrives on a | ||
| // clean finish; on cancellation the loop breaks before it, so without | ||
| // this running tally `recordFinish` would settle at 0 completion tokens | ||
| // and the coordinator would bill $0 for work already streamed to the | ||
| // client. `recordFinish` takes max(observed, terminal), so a clean | ||
| // finish still uses the exact `.info` count (>= the chunk tally). | ||
| var streamedTokens = 0 | ||
| var terminalCompletion: Int? = nil | ||
| var reportedPrompt = promptCount | ||
| // Defensive: the greedy text-only fast path should never see a parsed | ||
| // tool call (tool requests are kept on the engine path by the caller's | ||
| // `allowFastPath` gate). If one is surfaced anyway we cannot faithfully | ||
| // reproduce the engine's raw-text behavior, so we FAIL rather than drop. | ||
| var sawToolCall = false | ||
|
|
||
| for await gen in genStream { | ||
| // Cooperative cancellation: a client cancel / model reload cancels | ||
| // this task; break and let the finish bookkeeping below run. | ||
| if Task.isCancelled { break } | ||
| switch gen { | ||
| case .chunk(let text): | ||
| if !sawFirstToken { | ||
| sawFirstToken = true | ||
| await scheduler.recordFirstToken(requestId: id, at: .now) | ||
| } | ||
| streamedTokens += 1 | ||
| if !text.isEmpty { | ||
| continuation.yield(.chunk(text)) | ||
| } | ||
| case .info(let info): | ||
| reportedPrompt = info.promptTokenCount | ||
| terminalCompletion = info.generationTokenCount | ||
| case .toolCall: | ||
| // `container.generate` parsed a tool call (and may have | ||
| // CONSUMED its text rather than emitting it as `.chunk`s). | ||
| // Silently dropping it would lose the call; the engine path | ||
| // emits raw text and never `.toolCall`, so we cannot match it | ||
| // here. Mark failure and stop. | ||
| sawToolCall = true | ||
| } | ||
| if sawToolCall { break } | ||
| } | ||
|
|
||
| let cancelled = Task.isCancelled | ||
| // Billing-safe completion count: terminal exact count when present, | ||
| // otherwise the streamed-chunk lower bound (covers cancel + tool-call | ||
| // failure, where no `.info` arrived). | ||
| let completionTokens = max(terminalCompletion ?? 0, streamedTokens) | ||
| let succeeded = !cancelled && !sawToolCall | ||
| // Reuse the engine bridge's finish bookkeeping: removes the bridge, | ||
| // updates the decode + prefill EWMA, releases the KV reservation, and | ||
| // returns billing-safe usage counts (max of observed vs. terminal). | ||
| let usage = await scheduler.recordFinish( | ||
| requestId: id, | ||
| promptTokens: reportedPrompt, | ||
| completionTokens: completionTokens, | ||
| success: succeeded) | ||
|
|
||
| // Emit delivered usage (so a listener can bill partial work) before | ||
| // any terminal error, mirroring the engine bridge. | ||
| if !succeeded, usage.promptTokens > 0 || usage.completionTokens > 0 { | ||
| continuation.yield(.info( | ||
| promptTokens: usage.promptTokens, | ||
| completionTokens: usage.completionTokens, | ||
| tokensPerSecond: usage.tps)) | ||
| } | ||
| if cancelled { | ||
| continuation.yield(.error("request cancelled")) | ||
| } else if sawToolCall { | ||
| continuation.yield(.error( | ||
| "fast path does not support tool calls; please retry")) | ||
| } else { | ||
| continuation.yield(.info( | ||
| promptTokens: usage.promptTokens, | ||
| completionTokens: usage.completionTokens, | ||
| tokensPerSecond: usage.tps)) | ||
| } | ||
| continuation.finish() | ||
| await scheduler.clearFastPathTask(id) | ||
| } | ||
| fastPathTasks[id] = task | ||
| } | ||
|
|
||
| // MARK: - Task tracking / teardown | ||
|
|
||
| /// Remove a finished fast-path task handle. Called by the task itself on | ||
| /// completion. Safe for an unknown id. | ||
| func clearFastPathTask(_ id: String) { | ||
| fastPathTasks.removeValue(forKey: id) | ||
| } | ||
|
|
||
| /// Cancel the in-flight fast-path task for `id`, if any. Returns true when a | ||
| /// task existed and was cancelled. The task observes `Task.isCancelled`, | ||
| /// runs its finish bookkeeping (KV release, bridge removal, terminal events) | ||
| /// and clears its own handle. | ||
| @discardableResult | ||
| func cancelFastPathTask(_ id: String) -> Bool { | ||
| guard let task = fastPathTasks[id] else { return false } | ||
| task.cancel() | ||
| return true | ||
| } | ||
|
|
||
| /// Cancel every in-flight fast-path task (model reload / `cancelAll`). Each | ||
| /// task self-removes; callers that also clear `fastPathTasks` (e.g. | ||
| /// `stopCurrentEngine`) make late `clearFastPathTask` calls harmless no-ops. | ||
| func cancelAllFastPathTasks() { | ||
| for task in fastPathTasks.values { task.cancel() } | ||
| } | ||
|
|
||
| /// Cancel AND fence every in-flight fast-path task — used by | ||
| /// `stopCurrentEngine` before it nil's `modelContainer` and clears the MLX | ||
| /// cache. Unlike the engine (which is fenced by `stopAndWait`), a fast-path | ||
| /// task runs off-engine inside `ModelContainer.generate`, holding and running | ||
| /// GPU work against the model + its KV cache. If teardown freed that state | ||
| /// while a task were still mid-`generate`, it could touch released model/MLX | ||
| /// state. Awaiting each task's value blocks until it has observed | ||
| /// cancellation, run its finish bookkeeping (KV release + bridge removal + | ||
| /// terminal events) and dropped its model/iterator references. | ||
| /// | ||
| /// The handles are snapshotted first so a self-removing `clearFastPathTask` | ||
| /// during the awaits cannot mutate the collection being iterated. The await | ||
| /// suspends the actor so those actor-isolated callbacks make progress; no NEW | ||
| /// fast path can start meanwhile because `stopCurrentEngine` has already | ||
| /// nil'd `engine` (every submit path short-circuits on a nil engine). | ||
| /// Idempotent: a no-op when nothing is in flight. | ||
| func waitForFastPathTasks() async { | ||
| let inflight = Array(fastPathTasks.values) | ||
| guard !inflight.isEmpty else { return } | ||
| for task in inflight { task.cancel() } | ||
| for task in inflight { await task.value } | ||
| fastPathTasks.removeAll() | ||
| } | ||
| } | ||
|
|
||
| // MARK: - Test support | ||
| // | ||
| // Internal + `@testable`-only; dead-code-stripped from production binaries. | ||
|
|
||
| extension BatchScheduler { | ||
| /// Force the B=1 fast-path enablement gate on/off, bypassing the env flags. | ||
| /// `nil` restores env-driven behavior. Lets a benchmark A/B the fast path vs. | ||
| /// the batched engine in a single process (mutating `ProcessInfo`'s cached | ||
| /// environment mid-run is unreliable). | ||
| func _setForceB1FastPathForTest(_ value: Bool?) { | ||
| _forceB1FastPathForTest = value | ||
| } | ||
|
|
||
| /// Test accessor: number of in-flight fast-path tasks currently tracked. | ||
| func _fastPathTaskCountForTest() -> Int { fastPathTasks.count } | ||
| } |
There was a problem hiding this comment.
🔴 [CRITICAL] 🧩 370-line fast path implementation with single consumer adds significant complexity
💡 Suggestion: Consider if this optimization justifies the maintenance burden - the fast path duplicates engine functionality, adds new failure modes, and requires careful coordination with the existing batched engine
📊 Score: 4×4 = 16 · Category: over-abstraction
Summary
libs/mlx-swift-lmatLayr-Labs/mlx-swift-lm@mainafter feat: inherit Gemma4 vMLX decode stack mlx-swift-lm#56.libs/mlx-swiftate20ea3d(feat/inherit-upstream-2026-06) intentionally, notmain, becausemlx-swift@maincurrently declaresswift-tools-version: 6.3;(experimentalCGen)while d-inference is pinned to Swift 6.1.Gemma4DecodeProfileTestslive benchmark for Gemma-4-26B-A4B raw B=1 decode and optional text sample output.Results
mlx-community/gemma-4-26b-a4b-it-8bit(M4 Max, B=1, 128 generated tokens).mlx-vlmon the same machine/model: 69.7 TPS.DARKBLOOM_GEMMA_PRINT_TEXT=1.Before
After
Verification
cd /tmp/dbval/gemma4rawbench && swift run -c release Gemma4RawBenchagainst this repo's local packages: 79.5 TPS./tmp/dbval/gemma4_1k_output.txt,/tmp/dbval/gemma4_1k_prompt_ids.json,/tmp/dbval/gemma4_1k_output_ids.json.cd /tmp/dbval/d-inference-pr-main/provider-swift && swift build --build-tests: passed after cleaning stale SwiftPM cache.DARKBLOOM_LIVE_MLX_TESTS=1 DARKBLOOM_LIVE_MLX_GEMMA=1 swift test --filter Gemma4DecodeProfileTests.Notes
mlx-swift@mainshould be adopted later only after either d-inference moves to Swift 6.3 or mlx-swift main restores a Swift-6.1-compatible manifest. This PR keeps the documented Swift 6.1 toolchain path working.swift test -c releaseremains blocked by an unrelatedkv-se-harnesscompile issue, so the fair release throughput number was measured with a small temp executable depending on the same local packages.