Releases: leehack/llamadart
v0.8.10
v0.8.9
v0.8.8
v0.8.7
0.8.7
- Fixed multimodal chat-template rendering so templates that force-open reasoning, for example Qwen3.5 VLM prompts ending with
<think>, preserveenable_thinkingand stream generated reasoning throughdelta.thinkinginstead ofdelta.content.
Validation highlights:
dart pub publish --dry-run- GitHub CI on PR #240
- Local real-model smokes: Qwen3.5 multimodal macOS repro and Gemma 4 GGUF chat features smoke
- Tag-triggered
Publish to pub.devworkflow - Tag-triggered
Docs Version Cutworkflow - GitHub Pages docs deploy
./tool/docs/validate_links.sh
v0.8.6
0.8.6
Patch release for the llama.cpp native runtime sync to leehack/llamadart-native@b9776.
- Updated the default llama.cpp native runtime pin to
leehack/llamadart-native@b9776. - Regenerated the matching Dart FFI bindings.
- Refreshed the
llamadart_llama_cpp_flutterApple SwiftPM checksum. - Aligned README and website native override docs for the new native pin and companion install snippets.
Validation highlights:
dart pub publish --dry-run- GitHub CI on PR #237 and post-merge
main - LiteRT-LM Smoke on post-merge
main ./tool/docs/validate_links.sh- Tag-triggered
Publish to pub.devworkflow - Tag-triggered
Docs Version Cutworkflow
v0.8.5
0.8.5
Patch release for the mtmd split-library fallback ABI used by multimodal llama.cpp loads.
- Fixed the split-library mtmd fallback ABI for image and byte-buffer multimodal inputs so Windows
mtmd.dlland other split mtmd native bundles use the same bitmap helper signature as the generated native binding path. - This avoids corrupting the first mtmd bitmap-helper call for Gemma 4/MMProj style multimodal loads.
- Added native symbol regression coverage for the fallback ABI.
Validation highlights:
v0.8.3
0.8.3
Patch release for Windows CUDA backend discovery in native-assets builds.
- Fixed Windows CUDA backend discovery when the native asset bundle directory is
not on the appPATH. - llama.cpp backend modules are now loaded from their resolved bundle path in a
way that lets colocated CUDA redistributables such ascudart64_12.dll,
cublas64_12.dll, andcublasLt64_12.dllresolve correctly.
Validation highlights:
dart pub publish --dry-run- GitHub CI on PR #228 and post-merge
main Publish to pub.devDocs Version CutDocs Pages
v0.8.2
-
Updated the default llama.cpp native runtime pin to
leehack/llamadart-native@b9694, regenerated matching Dart FFI bindings,
refreshed thellamadart_llama_cpp_flutterApple SwiftPM checksum, and
updated the default WebGPU bridge asset pin to
leehack/llama-web-bridge-assets@v0.1.17(llama.cppb9699). The WebGPU
backend now caps unset large-model browser batches so Gemma 4 mem64 loads do
not fall back to context-sized compute buffers. -
Added
BackendGpuEnumeration.listGpuDevices({probeBackends})(exposed via
LlamaEngine.listGpuDevices) to enumerate GPU-class devices for offload
selection: backend, per-backendmainGpuindex, name, description, device
id, type, and free/total memory per device. With an emptyprobeBackends
only already-registered backends are inspected, so an unsupported GPU runtime
cannot crash the process during enumeration; pass specific backends to opt
into loading just those modules first. Web/WebGPU return an empty list. -
Added Cohere2 MoE / North Code chat-template detection and parsing so
<|START_TEXT|>responses and<|START_ACTION|>tool-call arrays are
handled separately from older Command-R templates.
v0.8.1
- Fixed docs references that still pointed at
llamadart_litert_lm_flutter0.0.1and
the pre-native.1LiteRT-LM release after the 0.8.0 native pin sync moved
LiteRT-LM Apple/runtime artifacts tov0.13.1-native.1. - Routed native
.litertlmimage/audio chat parts through LiteRT-LM
Conversation message JSON so bundles with native media processors can accept
LlamaImageContent/LlamaAudioContentpath and encoded-byte inputs without
a separatemmprojprojector.
v0.8.0
- Flutter Apple runtime packaging:
- Split SwiftPM-linked Apple runtime packaging out of the core package into
llamadart_llama_cpp_flutterfor GGUF/llama.cpp and
llamadart_litert_lm_flutterfor.litertlm/LiteRT-LM. These companion
packages live underpackages/in this repository and publish as separate
pub.dev packages. - Removed Flutter plugin metadata from
llamadartso pure Dart/native-assets
consumers can keep using the core package without taking a Flutter SDK
constraint. - Started the companion packages at
0.0.1; native pin sync bumps only the
affected companion package patch version. Companion package publishing uses
package-specific tags after the first manual pub.dev publish, and skips
companion versions that already exist on pub.dev. - Changed unset or empty
llamadart_native_runtimesto include all available
runtime families. For Flutter iOS/macOS app builds, installed companion
packages decide Apple SPM runtimes; for every other build,
llamadart_native_runtimesremains the selector. - Updated the default llama.cpp native runtime pin to
leehack/llamadart-native@b9587, regenerated matching Dart FFI bindings,
and refreshed thellamadart_llama_cpp_flutterApple SwiftPM checksum.
- Split SwiftPM-linked Apple runtime packaging out of the core package into
- MTP benchmarking diagnostics:
- Added llama.cpp speculative decoding perf diagnostics for decode timing,
draft/accepted token counts, draft verification timing, and acceptance
rate so MTP benchmarks can separate backend decode cost from drafting
overhead. - Extended local macOS and chat app benchmark outputs with the new
diagnostics and added focused llama.cpp MTP smoke/benchmark tools for
baseline-vs-MTP comparisons.
- Added llama.cpp speculative decoding perf diagnostics for decode timing,
- llama.cpp MTP runtime support:
- Added
SpeculativeDecodingConfig.mtp(draftModelPath: ...)for llama.cpp
external draft-model MTP sessions, with draft model caching and cleanup
tied to the target model lifetime. - Removed the Android Vulkan MTP allow-list dart define and the model-name
based Android Vulkan acceleration shortcut; Vulkan MTP now runs only when
callers explicitly request Vulkan plus MTP in runtime parameters.
- Added
- Structured output:
- Added
responseFormatrouting toLlamaEngine.create(...)for
grammar-capable backends, deprecated the legacychatTemplate(...)
jsonSchemashortcut, and made strict response-format requests fail early
on LiteRT-LM instead of silently degrading to unconstrained generation.
- Added
- LiteRT-LM chat parity:
- Routed eligible native
.litertlmtext chat through LiteRT-LM Conversation
APIs so structured history, system messages, tool declarations, and
template extra context reach the runtime without a Dart-rendered prompt.
Unsupported cases still fall back to the existing Dart chat-template path.
- Routed eligible native
- LiteRT-LM runtime tuning controls:
- Added opt-in native
.litertlmModelParamsfor
liteRtLmActivationDataType,liteRtLmPrefillChunkSize,
liteRtLmParallelFileSectionLoading, andliteRtLmDispatchLibDir,
forwarding the pinned LiteRT-LMv0.13.1engine-settings C APIs while
keeping defaults unchanged. - Extended the LiteRT-LM engine smoke tool with matching environment
variables so real-model runs can validate load time, prefill throughput,
decode throughput, and selected runtime settings. - Documented support decisions for each candidate native knob and kept
LiteRT-LM web rejecting these native-only settings explicitly.
- Added opt-in native