Skip to content

Releases: leehack/llamadart

v0.8.10

26 Jun 21:34
88af21f

Choose a tag to compare

What's Changed

  • fix: use platform-appropriate model cache defaults by @leehack in #248
  • docs: restore model cache release note history by @leehack in #250
  • chore: prepare 0.8.10 release by @leehack in #249

Full Changelog: v0.8.9...v0.8.10

v0.8.9

26 Jun 16:06
291e448

Choose a tag to compare

What's Changed

  • chore: restore pub score compatibility by @leehack in #247

Full Changelog: v0.8.8...v0.8.9

v0.8.8

26 Jun 13:34
b7ecf15

Choose a tag to compare

What's Changed

  • Add cross-platform model cache strategies by @leehack in #242
  • chore(native): sync native release b9803 by @github-actions[bot] in #244
  • Prepare 0.8.8 release by @leehack in #245
  • docs: bump llama.cpp Flutter companion snippet by @leehack in #246

Full Changelog: v0.8.7...v0.8.8

v0.8.7

25 Jun 12:58
a34ddc6

Choose a tag to compare

0.8.7

  • Fixed multimodal chat-template rendering so templates that force-open reasoning, for example Qwen3.5 VLM prompts ending with <think>, preserve enable_thinking and stream generated reasoning through delta.thinking instead of delta.content.

Validation highlights:

  • dart pub publish --dry-run
  • GitHub CI on PR #240
  • Local real-model smokes: Qwen3.5 multimodal macOS repro and Gemma 4 GGUF chat features smoke
  • Tag-triggered Publish to pub.dev workflow
  • Tag-triggered Docs Version Cut workflow
  • GitHub Pages docs deploy
  • ./tool/docs/validate_links.sh

v0.8.6

24 Jun 09:55
ed6283d

Choose a tag to compare

0.8.6

Patch release for the llama.cpp native runtime sync to leehack/llamadart-native@b9776.

  • Updated the default llama.cpp native runtime pin to leehack/llamadart-native@b9776.
  • Regenerated the matching Dart FFI bindings.
  • Refreshed the llamadart_llama_cpp_flutter Apple SwiftPM checksum.
  • Aligned README and website native override docs for the new native pin and companion install snippets.

Validation highlights:

  • dart pub publish --dry-run
  • GitHub CI on PR #237 and post-merge main
  • LiteRT-LM Smoke on post-merge main
  • ./tool/docs/validate_links.sh
  • Tag-triggered Publish to pub.dev workflow
  • Tag-triggered Docs Version Cut workflow

v0.8.5

23 Jun 00:37
39adb93

Choose a tag to compare

0.8.5

Patch release for the mtmd split-library fallback ABI used by multimodal llama.cpp loads.

  • Fixed the split-library mtmd fallback ABI for image and byte-buffer multimodal inputs so Windows mtmd.dll and other split mtmd native bundles use the same bitmap helper signature as the generated native binding path.
  • This avoids corrupting the first mtmd bitmap-helper call for Gemma 4/MMProj style multimodal loads.
  • Added native symbol regression coverage for the fallback ABI.

Validation highlights:

  • dart pub publish --dry-run
  • GitHub CI on PR #235
  • LiteRT-LM Smoke on PR #235
  • ./tool/docs/validate_links.sh

v0.8.3

19 Jun 13:03
ad48b13

Choose a tag to compare

0.8.3

Patch release for Windows CUDA backend discovery in native-assets builds.

  • Fixed Windows CUDA backend discovery when the native asset bundle directory is
    not on the app PATH.
  • llama.cpp backend modules are now loaded from their resolved bundle path in a
    way that lets colocated CUDA redistributables such as cudart64_12.dll,
    cublas64_12.dll, and cublasLt64_12.dll resolve correctly.

Validation highlights:

  • dart pub publish --dry-run
  • GitHub CI on PR #228 and post-merge main
  • Publish to pub.dev
  • Docs Version Cut
  • Docs Pages

v0.8.2

18 Jun 13:08
2507608

Choose a tag to compare

  • Updated the default llama.cpp native runtime pin to
    leehack/llamadart-native@b9694, regenerated matching Dart FFI bindings,
    refreshed the llamadart_llama_cpp_flutter Apple SwiftPM checksum, and
    updated the default WebGPU bridge asset pin to
    leehack/llama-web-bridge-assets@v0.1.17 (llama.cpp b9699). The WebGPU
    backend now caps unset large-model browser batches so Gemma 4 mem64 loads do
    not fall back to context-sized compute buffers.

  • Added BackendGpuEnumeration.listGpuDevices({probeBackends}) (exposed via
    LlamaEngine.listGpuDevices) to enumerate GPU-class devices for offload
    selection: backend, per-backend mainGpu index, name, description, device
    id, type, and free/total memory per device. With an empty probeBackends
    only already-registered backends are inspected, so an unsupported GPU runtime
    cannot crash the process during enumeration; pass specific backends to opt
    into loading just those modules first. Web/WebGPU return an empty list.

  • Added Cohere2 MoE / North Code chat-template detection and parsing so
    <|START_TEXT|> responses and <|START_ACTION|> tool-call arrays are
    handled separately from older Command-R templates.

v0.8.1

12 Jun 14:29
e972106

Choose a tag to compare

  • Fixed docs references that still pointed at
    llamadart_litert_lm_flutter 0.0.1 and
    the pre-native.1 LiteRT-LM release after the 0.8.0 native pin sync moved
    LiteRT-LM Apple/runtime artifacts to v0.13.1-native.1.
  • Routed native .litertlm image/audio chat parts through LiteRT-LM
    Conversation message JSON so bundles with native media processors can accept
    LlamaImageContent / LlamaAudioContent path and encoded-byte inputs without
    a separate mmproj projector.

v0.8.0

10 Jun 22:40
7d41702

Choose a tag to compare

  • Flutter Apple runtime packaging:
    • Split SwiftPM-linked Apple runtime packaging out of the core package into
      llamadart_llama_cpp_flutter for GGUF/llama.cpp and
      llamadart_litert_lm_flutter for .litertlm/LiteRT-LM. These companion
      packages live under packages/ in this repository and publish as separate
      pub.dev packages.
    • Removed Flutter plugin metadata from llamadart so pure Dart/native-assets
      consumers can keep using the core package without taking a Flutter SDK
      constraint.
    • Started the companion packages at 0.0.1; native pin sync bumps only the
      affected companion package patch version. Companion package publishing uses
      package-specific tags after the first manual pub.dev publish, and skips
      companion versions that already exist on pub.dev.
    • Changed unset or empty llamadart_native_runtimes to include all available
      runtime families. For Flutter iOS/macOS app builds, installed companion
      packages decide Apple SPM runtimes; for every other build,
      llamadart_native_runtimes remains the selector.
    • Updated the default llama.cpp native runtime pin to
      leehack/llamadart-native@b9587, regenerated matching Dart FFI bindings,
      and refreshed the llamadart_llama_cpp_flutter Apple SwiftPM checksum.
  • MTP benchmarking diagnostics:
    • Added llama.cpp speculative decoding perf diagnostics for decode timing,
      draft/accepted token counts, draft verification timing, and acceptance
      rate so MTP benchmarks can separate backend decode cost from drafting
      overhead.
    • Extended local macOS and chat app benchmark outputs with the new
      diagnostics and added focused llama.cpp MTP smoke/benchmark tools for
      baseline-vs-MTP comparisons.
  • llama.cpp MTP runtime support:
    • Added SpeculativeDecodingConfig.mtp(draftModelPath: ...) for llama.cpp
      external draft-model MTP sessions, with draft model caching and cleanup
      tied to the target model lifetime.
    • Removed the Android Vulkan MTP allow-list dart define and the model-name
      based Android Vulkan acceleration shortcut; Vulkan MTP now runs only when
      callers explicitly request Vulkan plus MTP in runtime parameters.
  • Structured output:
    • Added responseFormat routing to LlamaEngine.create(...) for
      grammar-capable backends, deprecated the legacy chatTemplate(...)
      jsonSchema shortcut, and made strict response-format requests fail early
      on LiteRT-LM instead of silently degrading to unconstrained generation.
  • LiteRT-LM chat parity:
    • Routed eligible native .litertlm text chat through LiteRT-LM Conversation
      APIs so structured history, system messages, tool declarations, and
      template extra context reach the runtime without a Dart-rendered prompt.
      Unsupported cases still fall back to the existing Dart chat-template path.
  • LiteRT-LM runtime tuning controls:
    • Added opt-in native .litertlm ModelParams for
      liteRtLmActivationDataType, liteRtLmPrefillChunkSize,
      liteRtLmParallelFileSectionLoading, and liteRtLmDispatchLibDir,
      forwarding the pinned LiteRT-LM v0.13.1 engine-settings C APIs while
      keeping defaults unchanged.
    • Extended the LiteRT-LM engine smoke tool with matching environment
      variables so real-model runs can validate load time, prefill throughput,
      decode throughput, and selected runtime settings.
    • Documented support decisions for each candidate native knob and kept
      LiteRT-LM web rejecting these native-only settings explicitly.