llamadart 0.6.13
llamadart: ^0.6.13 copied to clipboard
A Dart/Flutter plugin for llama.cpp - run LLM inference on any platform using GGUF models
0.6.13 #
- Model source download/cache manager:
- Added
ModelSourcefor local paths, HTTP(S) URLs, and Hugging Facehf://owner/repo/path/to/model.ggufreferences, including deterministic cache keys and redacted metadata/log identities for signed URLs. - Added
ModelLoadOptions,ModelCachePolicy, resolver targets, and download/cache metadata/progress value models for package-managed model download and cache management. - Added native/file-backed
DefaultModelDownloadManagersupport for streaming HTTP downloads,.partfiles, atomic promotion, persisted metadata, authenticated bearer/custom headers, cancellation, retry, Range resume, cache hit/refresh/cache-only/no-cache policies, SHA-256 verification, cache listing, removal, clearing, and age/size pruning. - Improved Hugging Face source ergonomics:
hf://references now accept?revision=...for branch/ref names containing slashes, and docs clarify current single-file behavior, private/gated bearer-token usage, separatemmprojasset handling, sharded-GGUF limitations, and redaction guarantees. - Serialized concurrent stable-cache downloads for the same remote cache entry
across manager instances so duplicate callers do not race on shared
.partfiles or metadata, while distinct cache entries can still download in parallel and waiting-caller cancellation does not cancel the active download. - Hardened versioned cache metadata recovery: completed files can rebuild missing, malformed, or unsupported-schema sidecars without network access, while byte-count and stored/caller SHA-256 mismatches are treated as cache misses and safely re-downloaded.
- Clarified
ModelSource.path(...)option semantics: local paths now reject remote/download-only options (non-default cache policies, cache directories, authenticated headers, resume, and retry overrides) while continuing to support cancellation and optional local SHA-256 verification. - Added
LlamaEngine.loadModelSource(...)to route local sources through the existing native local loader, remote sources through the native download cache before local loading, and simple remote sources through URL-capable web backends when available. - Migrated server/testing helpers away from ad-hoc model downloads so examples dogfood the package-managed cache manager.
- Added
- State persistence API:
- Added
LlamaEngine.supportsStatePersistence,LlamaEngine.stateSaveFile(...), andLlamaEngine.stateLoadFile(...)so callers can persist and restore llama.cpp KV-cache state for fast raw-prompt resume/fork workflows. - Added
BackendStatePersistence,BackendStatePersistenceSupport, andStateLoadResultfor custom backend implementers and diagnostics. - Documented that state files are opaque llama.cpp artifacts tied to the same
model and runtime/build, that native paths use the app filesystem while web
paths use the bridge WASMFS virtual filesystem, and that
ChatSessionmessage history must be persisted separately. - Added WebGPU bridge state persistence wiring for bridge assets
v0.1.15+, including Dart JS interop, backend forwarding, and browser integration test coverage.
- Added
- Compatibility note: no public API breaking changes in
0.6.13; existingloadModel(...)callers are unchanged. Code that probes state persistence support should preferLlamaEngine.supportsStatePersistenceover structural backend type checks so web/router backends can report bridge-version-dependent support accurately.
0.6.12 #
- Native runtime sync:
- Updated native hook pinning to
leehack/llamadart-native@b9016, picking up the CUDA 12.8 Blackwell-capable native bundles. - Updated default web bridge asset pinning to
leehack/llama-web-bridge-assets@v0.1.14(llama.cppb9016) so native and web runtimes track the same upstream revision. - Picked up the bridge-side Qwen UTF-8 streaming stabilization and multimodal fallback narrowing, while preserving control-token output for parser consumers.
- Picked up the bridge-side BERT embedding thread-pool sizing fix so automatic thread selection does not exceed the compiled WebAssembly pthread pool.
- Updated native hook pinning to
- Load-time tuning knobs:
- Added
ModelParams.useMmap(defaulttrue) andModelParams.useMlock(defaultfalse), wired tollama_model_params.use_mmap/use_mlock. Lets callers turn off mmap for platforms where memory-mapped weights hurt throughput, or pin weights in RAM to avoid first-token paging spikes. - Added
ModelParams.flashAttentionwith theFlashAttention.{auto, enabled, disabled}enum, wired tollama_context_params.flash_attn_type. Explicit settings win over the existing automatic Android/Vulkan heuristics;autopreserves prior behavior. - Added
ModelParams.cacheTypeKandModelParams.cacheTypeVwith theKvCacheType.{f16, q8_0, q4_0}enum, wired tollama_context_params.type_k/type_v. Enables KV-cache quantization (Q8_0 ≈ halves KV memory; Q4_0 ≈ quarters it). When the user requests a non-F16 KV type withflashAttention: auto, the service auto-promotes flash attention to enabled — llama.cpp requires it for KV quantization. - Added
ModelParams.kvUnified(nullable) for explicit override ofllama_context_params.kv_unified.nullkeeps the existing auto-enable-when-multi-sequence behavior. - Added
ModelParams.ropeFrequencyBaseandModelParams.ropeFrequencyScale(both nullable) for context-extension overrides onllama_context_params.rope_freq_base/rope_freq_scale.nullkeeps the model's trained values. - Forwarded native-compatible
ModelParamsload tuning knobs through the WebGPU bridge path, includingmaxParallelSequences, flash attention, KV-cache type, KV-unified, RoPE, split-mode, and main-GPU options. - Matched native batch defaults on the WebGPU path so unset
batchSize/microBatchSizecascade ton_batch = n_ctxandn_ubatch = n_batch, avoiding first-embedding aborts for BERT-class/non-causal encoder models while preserving explicit caller values and Qwen3.5 web tuning.
- Added
- GPU device selection API:
- Added
ModelParams.mainGpuand wired it to llama.cppllama_model_params.main_gpu. - Added
ModelParams.splitModeand wired it to llama.cppllama_model_params.split_mode, enabling explicit single-GPU selection withModelSplitMode.none.
- Added
- Windows split-bundle loader fix:
- Resolved ggml backend registry/device APIs from the loaded ggml runtime DLL when the generated default FFI asset cannot see those symbols, restoring explicit Vulkan device selection in Windows split bundles.
- Native packaging size fix:
- Filtered backend-owned runtime dependencies during native asset bundling so CUDA runtime DLLs and OpenBLAS runtime libraries are emitted only when their owning backend module is selected.
- Kept unknown non-core runtime libraries bundled for compatibility with future native bundle layouts.
- Compatibility note: no public API breaking changes in
0.6.12.
0.6.11 #
- Native runtime syncs:
- Updated native hook pinning and regenerated bindings through
leehack/llamadart-native@b8955.
- Updated native hook pinning and regenerated bindings through
- Gemma 4 streaming fix:
- Parsed streamed
<|channel>thought ... <channel|>blocks into thinking deltas instead of leaking Gemma 4 thought markers into content output. - Added engine coverage for Gemma 4 thought-channel chunks split across native stream boundaries.
- Parsed streamed
- Release stability:
- Tracked the chat app lockfile so generated Flutter plugin metadata stays stable in CI and release validation.
- Compatibility note: no public API breaking changes in
0.6.11.
0.6.10 #
- Native runtime syncs:
- Updated native hook pinning and regenerated bindings through
leehack/llamadart-native@b8638.
- Updated native hook pinning and regenerated bindings through
- Multimodal context-safety hardening:
- Converted native multimodal prompt-evaluation overflow paths into Dart exceptions instead of allowing downstream sampling asserts.
- Downscaled staged chat-app image picks to a
384pxmax edge across Android, iOS, macOS, and Web to reduce multimodal context pressure. - Added a local-only macOS Qwen3.5 multimodal repro harness plus CI-safe provider coverage for the new overflow guidance.
- Gemma 4 template support and multimodal capability gating:
- Added built-in Gemma 4 template detection, rendering, and parsing support, including thinking and tool-call handling.
- Added runtime projector capability checks so multimodal flows and the chat app gate image/audio input against
supportsVision/supportsAudioinstead of model-family assumptions. - Documented current Gemma 4 projector behavior in the docs site and chat app guidance.
- Compatibility note: no public API breaking changes in
0.6.10.
0.6.9 #
- iOS deployment target alignment:
- Documented that iOS builds require a minimum deployment target of
16.4or newer across the README, docs site, and example docs. - Updated
example/chat_appiOS Podfile and Runner project settings to use deployment target16.4.
- Documented that iOS builds require a minimum deployment target of
- Android backend safety:
- Honored
ggml_backend_scoreduring asset-based backend fallback so unsupported Android CPU variant libraries are skipped before initialization. - Changed Android
autobackend resolution to prefer CPU by default while keeping Vulkan available for explicit opt-in. - Clarified that changing
hooks.user_definesrequiresflutter clean && flutter pub getbefore rebuilding.
- Honored
- Compatibility note: no public API breaking changes in
0.6.9.
0.6.8 #
- Native runtime sync:
- Updated native hook pinning and regenerated bindings to
leehack/llamadart-native@b8480. - Refreshed generated low-level FFI bindings to match the synced upstream headers.
- Updated native hook pinning and regenerated bindings to
- Compatibility note: no public API breaking changes in
0.6.8.
0.6.7 #
- Native runtime sync and Linux loader hardening:
- Updated native hook pinning and regenerated bindings to
leehack/llamadart-native@b8373. - Hardened Linux bundle loading for packaged apps and accepted versioned
libllamadartmappings so colocated native dependencies resolve more reliably at runtime.
- Updated native hook pinning and regenerated bindings to
- Hermes tool-call parsing fix:
- Fixed Hermes handler parsing when whitespace appears between
<tool_call>and the JSON payload.
- Fixed Hermes handler parsing when whitespace appears between
- Compatibility note: no public API breaking changes in
0.6.7.
0.6.6 #
- Runtime syncs:
- Updated native hook pinning to
leehack/llamadart-native@b8216. - Updated default web bridge asset pinning to
leehack/llama-web-bridge-assets@v0.1.10(llama.cppb8216).
- Updated native hook pinning to
- Qwen3.5 runtime stabilization (Android + Web):
- Switched bundled Qwen3.5 presets to Unsloth
Q4_K_MGGUFs across the example catalog and tooling. - Added Android-native perf diagnostics chips (
p_eval,eval,sample,reuse) backed by llama.cpp context timings with manual timing fallback when built-in counters report zero. - Restored a targeted Android Vulkan fast path for local Qwen3.5
0.8B/2B/4Bmodels by re-enabling KQV/op-offload/flash-attention where stable. - Updated Android chat app defaults to prefer CPU for Qwen3.5
0.8Band2B, and reduced Android0.8Bcontext to2048for lower first-token latency. - Hardened Android multimodal handling by downscaling staged images in the chat app and forcing Qwen3.5
0.8Bprojector work onto CPU on Android. - Fixed WebGPU Qwen prompt/control-token handling and committed companion bridge-side streaming/multimodal fixes required by the local chat app runtime.
- Switched bundled Qwen3.5 presets to Unsloth
- Compatibility note: no public API breaking changes in
0.6.6.
0.6.5 #
- Embedding API (native backend capability):
- Added
LlamaEngine.embed(...)andLlamaEngine.embedBatch(...)for direct vector generation. - Added optional backend capability interface
BackendEmbeddingsfor custom backend implementers. - Added optional backend batch capability
BackendBatchEmbeddingsand worker-side batch embedding request/response path to reduce isolate round-trip overhead inembedBatch(...). - Added
ModelParams.maxParallelSequences(n_seq_max) so contexts can reserve multiple sequence slots for true multi-sequence embedding batches. - Wired native isolate/worker/service embedding flow to llama.cpp embedding outputs with optional L2 normalization.
- Added embedding-focused tests for engine behavior and worker message contracts.
- Added
- Examples/docs:
- Added
example/basic_app/bin/llamadart_embedding_example.dart. - Added
example/basic_app/bin/llamadart_sqlite_vector_example.dartfor local embedding retrieval with SQLite vector search. - Updated example docs and top-level README with embedding usage snippets.
- Added
tool/testing/native_embedding_benchmark.dartto compare sequential embedding calls vsembedBatch(...)throughput (with optional--json-out). - Added
tool/testing/native_embedding_sweep.dartto run max-seq sweeps and dump CSV speedup reports for plotting.
- Added
- Web bridge sync:
- Added WebGPU bridge embedding APIs and wired web backend support for
LlamaEngine.embed(...)/embedBatch(...). - Updated default web bridge asset pinning to
leehack/llama-web-bridge-assets@v0.1.8. - Validated the
v0.1.8bridge bundle through local fetch-script checksum verification.
- Added WebGPU bridge embedding APIs and wired web backend support for
- WebGPU runtime tuning + multimodal stability (chat app/web):
- Reduced bridge log noise and improved runtime profile diagnostics for web sessions.
- Stabilized multimodal backend switching using resolved runtime mode behavior and added an E2E regression gate.
- Tuned streaming/typewriter pacing and token callback overhead to improve incremental render smoothness.
- Added GPU-path multimodal image-size capping to reduce runtime pressure on large image inputs.
- Chat app model catalog + stability:
- Updated
example/chat_apprecommended Qwen presets to the Qwen3.5 lineup (0.8B,2B,4B,9B) and removed older Qwen2.5/Qwen3 defaults from the in-app library. - Added multimodal projector (
mmproj) wiring for Qwen3.5 model cards and tuned safer multimodal defaults (contextSize: 8192,maxTokens: 1024). - Fixed Flutter text paint crashes caused by malformed UTF-16 streaming boundaries by aligning incremental reveal to surrogate-pair boundaries and sanitizing text/tool payload rendering paths.
- Added sanitizer unit coverage and refreshed chat-app README architecture/troubleshooting sections for multimodal and UTF-16 guidance.
- Updated
- Compatibility note: no public API breaking changes in
0.6.5.
0.6.4 #
-
Multimodal projector offload alignment:
- Updated native multimodal projector initialization to follow effective model-load configuration.
- CPU-only model settings (
preferredBackend: cpuorgpuLayers: 0) now also disable mmproj GPU offload.
-
Package metadata cleanup:
- Removed unused Flutter-only constraints/dependencies from the root
pubspec.yaml(environment.flutter,flutter,path_provider,json_rpc_2,integration_test) to keep the core package pure Dart. - Kept Flutter-specific dependencies scoped to Flutter example apps.
- Removed unused Flutter-only constraints/dependencies from the root
-
Backend selection safety and status accuracy:
- Added strict CPU-mode behavior in native backend preparation so
preferredBackend: cpuno longer initializes optional GPU backends during startup/model load probing. - Disabled context-time GPU offload knobs (
offload_kqv,op_offload, flash-attention auto path) when effective GPU layers resolve to zero, preventing GPU allocation attempts during context creation in CPU mode. - Added
ModelParams.batchSize(n_batch) andModelParams.microBatchSize(n_ubatch) so context batch sizing can be tuned independently fromcontextSizewhile preserving legacy defaults. - Split backend reporting into two semantics: selectable backend options (
getAvailableBackends) vs active runtime backend (getBackendName). - Added optional
BackendAvailabilitycapability andLlamaEngine.getAvailableBackends()to support safe settings UIs without forcing GPU initialization. - Added optional
BackendRuntimeDiagnosticscapability andLlamaEngine.getResolvedGpuLayers()to expose resolved native load-time layer count for runtime diagnostics. - Updated
example/chat_appto populate backend selector options from safe availability discovery while keeping active-backend status bound to effective runtime backend. - Improved native auto/explicit backend status resolution to avoid false CPU labeling on Apple consolidated runtimes and false GPU labeling when explicit backend falls back.
- Added strict CPU-mode behavior in native backend preparation so
-
Web model cache + large-model UX improvements (chat app):
- Updated web Download flow to prefetch model/mmproj bytes into browser Cache Storage with live progress and cancellation support.
- Added best-effort cache eviction for web model delete actions.
- Added large-model web load fallback to fetch-backed worker runtime path (bridge) to reduce contiguous
ArrayBufferpressure. - Added dedicated web bridge worker entry wiring and worker fallback diagnostics to improve worker startup reliability.
- Reduced synthetic load-progress dominance so bridge/network progress appears earlier during web model load.
- Added warning-only UI guidance for very large web models that may exceed browser memory limits at load time.
-
Web model-load resilience:
- Updated
WebGpuLlamaBackendto retry web model loads with reduced context sizes (and CPU fallback as last attempt) when bridge errors indicate browser memory pressure. - Added bridge config plumbing for optional wasm64 core assets (
llama_webgpu_core_mem64) with automatic fallback to wasm32 when unsupported. - Added explicit runtime diagnostics and error normalization for worker-thread and cross-origin-isolation requirements in large web model load flows.
- Updated default bridge asset pinning in chat app/docs/fetch script to
leehack/llama-web-bridge-assets@v0.1.5. - Updated HF static chat-app deploy workflow to emit COI
custom_headersin generated Space README frontmatter.
- Updated
-
Android arm64 CPU variant policy and loader hardening:
- Updated native hook tag pin from
b8138tob8157to consume Android arm64 CPU-variant runtime bundles. - Added Android arm64 CPU policy keys in hook config:
cpu_profile(fulldefault,compact) and advancedcpu_variantsoverride. - Added hook tests and Android hook integration coverage to verify pubspec-driven CPU variant packaging behavior.
- Hardened Android runtime backend loading to resolve CPU variant modules even when backend module directory discovery is unavailable.
- Added Android runtime smoke helper (
scripts/android_runtime_smoke.sh) and smoke-plan docs for device verification. - Compatibility note: no public API breaking changes.
android-arm64now defaults tocpu_profile: full, which may increase package size compared with baseline-only CPU packaging.
- Updated native hook tag pin from
0.6.3 #
- Native runtime sync (llama.cpp b8138):
- Synced bundled native runtime/assets and regenerated bindings from
b8099tob8138. - Pulled in Android arm64 ISA compatibility hardening (including STLUR guard changes) to prevent launch-time crashes on older devices.
- Synced bundled native runtime/assets and regenerated bindings from
- Example app performance and UX polish:
- Reduced settings-write overhead during frequent parameter adjustments.
- Improved model manager responsiveness during download progress updates.
- Smoothed chat streaming auto-follow and rendering to reduce unnecessary UI work.
- Web model handling improvements:
- Updated web "Download" behavior to verify remote model/mmproj availability without pre-buffering large GGUF payloads in app memory.
- Clarified that web cache population occurs when a model is first loaded.
- Stability and quality:
- Added safe fallback handling for invalid persisted log-level settings.
- Added regression tests for persisted settings fallback behavior.
- New example app:
- Added
example/tui_coding_agent, anocterm-based terminal coding agent with tool-calling loop, workspace-scoped file/command tools, and runtime model switching. - Default model source is GLM 4.7 Flash (
unsloth/GLM-4.7-Flash-GGUF:UD-Q4_K_XL) with support for custom local paths/URLs/Hugging Face shorthand. - Added stable text-protocol tool mode as the default (native template grammar tool-calling remains available via
--native-tool-callingfor experimentation).
- Added
0.6.2 #
- Native inference performance improvements:
- Reduced request overhead by caching model metadata and skipping
unnecessary prompt token counting in
create(...). - Improved native stream throughput with worker-side token chunk batching
and configurable thresholds (
streamBatchTokenThreshold,streamBatchByteThreshold). - Added prompt-prefix reuse for native text generation
(
reusePromptPrefix, enabled by default) with conservative full-replay fallback to preserve deterministic parity. - Optimized
ChatSessioncontext trimming using bounded turn-offset search to avoid repeated linear recount loops on long histories.
- Reduced request overhead by caching model metadata and skipping
unnecessary prompt token counting in
- Benchmarking and parity tooling:
- Added
tool/testing/native_inference_benchmark.dartfor TTFT, throughput, and latency measurement with tunable generation settings. - Added
tool/testing/native_prompt_reuse_parity.dartand curated prompt sets for deterministic prompt-reuse parity validation. - Added CI prompt-reuse parity checks to catch native reuse regressions.
- Added
0.6.1 #
-
Publishing compatibility fix:
- Moved hook backend-config support code out of
hook/src/intolib/src/hook/because pub.dev currently only allowshook/build.dartunder hook files. - Updated hook/test imports accordingly to keep native-assets backend selection behavior unchanged.
- Moved hook backend-config support code out of
-
llama.cpp parity expansion (Dart-native template/parser pipeline):
- Reworked template detection/render/parse routing to align with llama.cpp semantics across supported chat formats, including format-specific tool-call parsing and fallback behavior.
- Added PEG parity components in Dart (
peg_parser_builder,peg_chat_parser) and integrated parser-carrying render/parse flow for PEG-native/constructed formats. - Removed brittle fallback coercions that could mutate valid tool names/argument keys, preserving model-emitted tool payloads for dispatch parity.
- Hardened template capability detection with Jinja AST + execution probing, while preventing typed-content false positives caused by raw content stringification.
- [BREAKING] Removed legacy custom template-handler APIs:
ChatTemplateMatcher,ChatTemplateRoutingContext,ChatTemplateEngine.registerHandler(...),ChatTemplateEngine.unregisterHandler(...),ChatTemplateEngine.clearCustomHandlers(...),ChatTemplateEngine.registerTemplateOverride(...),ChatTemplateEngine.unregisterTemplateOverride(...),ChatTemplateEngine.clearTemplateOverrides(...), and per-callcustomHandlerId/ parsehandlerIdrouting. - Removed silent render/parse fallback paths so handler/parser failures are surfaced instead of downgraded to content-only output.
- Added llama.cpp-equivalent per-call template globals/time injection via
chatTemplateKwargsandtemplateNow.
-
Parity test coverage and tooling:
- Added vendored llama.cpp template parity integration coverage for detection + render + parse paths.
- Added upstream llama.cpp chat/template suite runners and local E2E harness (
run_llama_cpp_chat_tests.sh,run_template_parity_suites.sh). - Added mirrored unit tests for new internal template components (
peg_parser_builder,template_internal_metadata) to satisfy structure guards.
-
Test cleanup and maintainability:
- Reduced noisy diagnostics in template integration tests and centralized format sample parse payload fixtures for easier parity maintenance.
-
Native integration cleanup (llamadart-native migration):
- Added
tool/testing/prepare_llama_cpp_source.shto fetch/refreshggml-org/llama.cppinto.dart_tool/llama_cpp(orLLAMA_CPP_SOURCE_DIR) pinned to a resolved ref (LLAMA_CPP_REF, defaultlatestrelease tag). - Updated
tool/testing/run_llama_cpp_chat_tests.shto use prepared.dart_toolsource instead ofthird_party/llama_cpp, so local upstream chat-suite runs no longer depend on vendored source. - Updated template parity tests to resolve fixtures from
LLAMA_CPP_TEMPLATES_DIRor.dart_tool/llama_cpp/models/templatesinstead ofthird_party/llama_cpp. - Clarified README backend matrix notes:
KleidiAI/ZenDNNare CPU-path optimizations, not selectable runtime backend modules. - Runtime backend probing for split-module bundles now runs during backend initialization (not only after first model load), so device/backend availability is visible earlier in app flows.
- Native-assets hook output now refreshes emitted native files per build to prevent stale backend module carryover when backend config changes.
- Added
-
Linux runtime/link validation and backend loader hardening:
- Hardened split-module backend loading to avoid probing backends that are not bundled for the active platform/arch, reducing noisy optional-backend load failures.
- Added failed-backend memoization so missing optional modules are not retried on every model load.
- Tightened Linux cache source selection to the current ABI bundle (
linux-arm64vslinux-x64) when preparing runtime dependencies. - Added Linux backend/runtime setup guidance in README, including distro-specific package baselines (Ubuntu/Debian, Fedora/RHEL/CentOS, Arch).
- Added reproducible Docker link-check flows for baseline (
cpu/vulkan/blas) and optionalcuda/hipmodule dependency resolution. - Added
scripts/check_native_link_deps.shhelper plus dedicated validation images:docker/validation/Dockerfile.cuda-linkcheckanddocker/validation/Dockerfile.hip-linkcheck.
-
Chat example backend UX cleanup:
- Removed user-facing
Autobackend option from settings; only concrete runtime-detected backends are shown. - Added migration behavior that resolves legacy saved
Autopreference to the best detected backend at runtime.
- Removed user-facing
0.5.4 #
-
llama.cpp parity hardening:
ChatTemplateEnginenow preserves handler-provided tokens even when grammar is attached via params, avoiding token-loss regressions in tool/thinking formats.- Native stop-sequence handling now skips preserved tokens so parser-critical markers are not terminated early.
- Generic tool-instruction system injection now follows llama.cpp semantics more closely (replace first system content when supported, otherwise prepend to first message content).
- LFM2 output parsing now extracts reasoning more consistently across tool and non-tool output shapes.
-
Chat example loop/lifecycle hardening:
- Improved tool-loop guards (first-turn force-only behavior, duplicate/equivalent call suppression, per-tool budget, and loop-stop messaging).
- Added response fallback that can ground final answers from recent tool results when the model emits stale real-time disclaimers.
- Added assistant debug badges (
fmt:*,think:*,content:json,fallback:tool-result) and strengthened detach/exit disposal paths.
-
Parity/integration test robustness:
tool_calling_integration_testnow accepts both structuredtool_callsdeltas and XML-style<tool_call>payloads.- llama.cpp template-detection integration expectations were updated for current Ministral-family routing outcomes.
-
Documentation updates:
- Clarified chat app behavior when models return JSON-shaped assistant content (for example
{"response":"..."}) and documentedcontent:jsondiagnostics. - Documented example server sampling defaults (
penalty=1.0,top_p=0.95,min_p=0.05) and added a CLI README batch parity-matrix usage example.
- Clarified chat app behavior when models return JSON-shaped assistant content (for example
-
Chat app backend/status fixes:
- Backend switching now preserves configured
gpuLayerswhile still allowing load-time CPU enforcement. - Runtime backend labeling and GPU activity diagnostics now follow effective user selection, preventing false "VULKAN active" status when CPU mode is selected.
- Backend switching now preserves configured
-
Context size auto mode:
- Restored support for
Context Size: Autoby preserving0in persisted settings and passing auto behavior through to session context-limit resolution.
- Restored support for
-
Tool-call parsing fixes (Hermes):
- Introduced staged double-brace recovery: parse as-is first, unwrap one outer
{{...}}layer second, and only fall back to full_normalizeDoubleBraceswhen all braces are consistently doubled. - Added a consistency gate to
_normalizeDoubleBracesthat bails out on mixed single/double brace payloads to prevent corruption of valid nested JSON.
- Introduced staged double-brace recovery: parse as-is first, unwrap one outer
-
Tool-call parsing fixes (Magistral):
- Broadened whitespace skipping in
_extractJsonObjectto handle\n,\r, and\tbetween[ARGS]and the JSON body.
- Broadened whitespace skipping in
-
Example app (basic_app):
- Replaced
toList()buffering withawait forstreaming for real-time token yield. - Added
toolsparameter to every follow-upcreate()call and bounded tool-execution loop with_maxToolRounds = 10.
- Replaced
-
Test coverage:
- Added chat app regression tests for backend switching behavior and context-size auto persistence.
- Added regression tests for Hermes wrapped+nested double-brace payloads and Magistral
[ARGS]with newline/nested arguments.
-
Example rename (server):
- Renamed
example/api_servertoexample/llamadart_server. - Renamed the example package/bin entrypoint to
llamadart_server. - Updated llama.cpp tool-call parity defaults/docs to target
example/llamadart_server.
- Renamed
-
GLM 4.5 template parity:
- Added XML tool-call grammar generation for
<tool_call>payloads with<arg_key>/<arg_value>pairs. - Added GLM-specific preserved tokens and
<|user|>stop handling for tool-call flows. - Updated parser extraction to handle GLM XML tool calls from assistant content and reasoning blocks.
- Added XML tool-call grammar generation for
-
Template/native runtime fixes:
- Typed-content template rendering now activates only when messages actually include media parts.
- Native context reset now clears llama memory in-place instead of reinitializing the context.
0.5.3 #
- Sampling controls:
- Added
minPtoGenerationParamswith a default value of0.0andcopyWithsupport.
- Added
- Native backend parity:
- Added optional llama.cpp
min_psampler initialization inLlamaCppServicewhenminP > 0.
- Added optional llama.cpp
- Test coverage:
- Added unit coverage for
GenerationParams.minPdefault andcopyWithbehavior.
- Added unit coverage for
0.5.2 #
- Chat template parity hardening:
- Expanded llama.cpp parity across additional format handlers, including grammar construction, lazy-grammar triggers, preserved tokens, and parser behavior for tool-call payload extraction.
- Added shared
ToolCallGrammarUtilshelpers for wrapped object/array tool-call grammar generation and root-rule wrapping.
- Crash fix (grammar parsing):
- Fixed malformed GBNF escaping in Hermes/Command-R string rules that could cause runtime
llama_grammar_init_implparse failures during tool-calling generations.
- Fixed malformed GBNF escaping in Hermes/Command-R string rules that could cause runtime
- Test coverage expansion:
- Added and expanded handler-level parity tests (Apertus, LFM2, Nemotron V2, Magistral, Seed-OSS, Xiaomi MiMo, DeepSeek R1/V3, Hermes) and mirrored unit tests for new grammar utilities.
0.5.1 #
- Documentation fixes:
- Updated README internal links to absolute GitHub URLs so they resolve reliably on pub.dev.
- Updated release/migration wording after 0.5.0 publication and refreshed installation/version snippets.
- Corrected iOS simulator architecture notes and contributor prerequisites/build target docs.
- Publishing hygiene:
- Expanded
.pubignoreto exclude local build outputs, large model/test artifacts, and checked-outthird_partysources from package uploads.
- Expanded
0.5.0 #
-
[BREAKING] Public API Changes:
- Root exports were tightened; previously exposed internals such as
ToolRegistry,LlamaTokenizer, andChatTemplateProcessorare no longer part of the public package API. ChatSessionnow centers oncreate(...)streamingLlamaCompletionChunk; legacychat(...)/chatText(...)style usage must migrate.LlamaChatMessageconstructor names were standardized (.fromText,.withContent) in place of older named constructors.- Default
maxTokensinGenerationParamsincreased from512to4096. LlamaChatMessage.toJson()no longer includesnameontoolrole messages.ModelParams.logLevelwas removed; logging control now lives onLlamaEngineviasetDartLogLevel(...)andsetNativeLogLevel(...).LlamaBackendinterface changed for custom backend implementers (notablygetVramInfoand updatedapplyChatTemplate).- Model reload behavior is stricter:
loadModel(...)now requires unloading first. - Migration details are documented in
MIGRATION.md.
- Root exports were tightened; previously exposed internals such as
-
Template/Parser Parity Expansion:
- Added llama.cpp-aligned format detection and handlers for additional templates including FireFunction v2, Functionary v3.2, Functionary v3.1 (Llama 3.1), GPT-OSS, Seed-OSS, Nemotron V2, Apertus, Solar Open, EXAONE MoE, Xiaomi MiMo, and TranslateGemma.
- Improved parser parity for format-specific tool-calling and reasoning extraction, including
<|python_tag|>parsing for Llama 3 flows. - Narrowed generic grammar auto-application to generic/content-only routing to avoid interfering with format-specific tool schemas.
-
Template Extensibility APIs:
- Added global custom handler registration and template override APIs in
ChatTemplateEngine. - Added per-call
customTemplateandcustomHandlerIdrouting support and threaded handler identity into parse paths. - Added cookbook examples and regression tests for registration precedence and fallback behavior.
- Added global custom handler registration and template override APIs in
-
Logging Controls:
- Added split logging controls in
LlamaEngine:setDartLogLevelandsetNativeLogLevel, while keepingsetLogLevelas a convenience method. - Fixed native
nonelog level suppression so llama.cpp/ggml logs are fully muted when requested.
- Added split logging controls in
-
Chat App Improvements:
- Added model capability badges and per-model generation presets.
- Added template-aware tool enablement guardrails and separate Dart/native log level settings in the UI.
-
Test Suite Overhaul:
- Expanded template parity coverage (detection, handlers, grammar, workarounds, registry precedence, and integration scenarios).
- Added additional unit tests for exceptions, logging, and core model definitions.
0.4.0 #
- Cross-Platform Architecture:
- Refactored
LlamaBackendfor strict Web isolation using "Native-First" conditional exports, ensuring native performance and full web safety. - Standardized backend instantiation via a unified
LlamaBackend()factory across all examples and scripts.
- Refactored
- Web & Context Stability:
- Resolved "Max Tokens is 0" on Web by implementing
getLoadedContextInfo()and robust GGUF metadata fallback inLlamaEngine. - Improved numeric metadata extraction on Web for better compatibility with varied GGUF exporters.
- Resolved "Max Tokens is 0" on Web by implementing
- GBNF Grammar Stability:
- Resolved "Unexpected empty grammar stack" crash by reordering the sampler chain (filtering tokens via GBNF before performing probability-based sampling).
- Test Suite Overhaul:
- Pivoted from mock-based unit tests to real-world integration tests using the actual
llama.cppnative backend. - Ensured full verification of model loading, tokenization, text generation, and grammar constraints against physical models.
- Multi-Platform Configuration: Introduced
dart_test.yamland@TestOntags to enable seamless execution of all tests across VM and Chrome with a singledart testcommand.
- Pivoted from mock-based unit tests to real-world integration tests using the actual
- Robust Log Silencing:
- Implemented FD-level redirection (
dup2to/dev/null) forLlamaLogLevel.noneon native platforms. - This provides a crash-free alternative to FFI-based log callbacks, which were unstable during low-level native initialization (e.g., Metal).
- Implemented FD-level redirection (
- Project Hygiene:
- Achieved 100% clean
dart analyzeacross the core library and all example applications. - Replaced legacy stubs in the chat application with a clean, interface-based
ModelServicearchitecture.
- Achieved 100% clean
- Resumable Downloads:
- Implemented robust resumable downloads for large models using HTTP Range requests.
- Added persistent
.metafiles to track download progress across app restarts.
- Enhanced Download UI:
- Refined the
ModelCardwith a visual Pause/Resume toggle. - Added a Trash icon in the card header for full cancellation and data discard of active or partial downloads.
- Improved progress feedback with clear "Paused" and "Downloading" states.
- Refined the
- Multimodal Support (Vision & Audio): Integrated the experimental
mtmdmodule fromllama.cppfor native platforms.- Added
loadMultimodalProjectortoLlamaEngine. - Introduced
LlamaChatMessage.withContentandLlamaContentPart(Text, Image, Audio). - Fix: Resolved missing multimodal symbols in native builds by properly linking the
mtmdmodule.
- Added
- Moondream 2 & Phi-2 Optimization:
- Implemented a specialized
Question: / Answer:chat template fallback for Moondream models. - Added dynamic BOS token handling: Automatically disables BOS injection for models where BOS == EOS (like Moondream) to prevent immediate "End of Generation".
- Implemented a specialized
- Chat API Consolidation:
- Moved high-level
chat()andchatWithTools()logic fromLlamaEnginetoChatSession. LlamaEngineis now a dedicated low-level orchestrator for model loading, tokenization, and raw inference.
- Moved high-level
- Intelligent Tool Flow:
- Optional Tool Calls: Tools are no longer forced by default. The model now decides when to use a tool vs. responding directly based on context.
- Final Response Generation: After a tool returns a result, the model now generates a natural language response (without grammar constraints) to interpret the result for the user.
- forceToolCall: Added a session-level flag to re-enable strict grammar-constrained tool calls for smaller models (e.g., 0.5B - 1B).
- App Stability & Resources:
- Fixed a crash in the Flutter chat app during close/restart by implementing and using an idempotent
dispose()inChatService. - Added Qwen 2.5 3B and 7B models to the download list with clear RAM/VRAM requirements for testing complex instruction following and tool use.
- Fixed a crash in the Flutter chat app during close/restart by implementing and using an idempotent
- ChatSession Manager: Introduced a new high-level
ChatSessionclass to automatically manage conversation history and system prompts. - Context Window Management:
ChatSessionnow implements an automated sliding window to truncate history when the model's context limit is approached. - Windows Robustness:
- Improved export management for MSVC to ensure symbol visibility.
- Added Sccache support for Windows builds to significantly improve CI performance.
- Automated Lifecycle:
- Implemented GitHub Actions to automate
llama.cppupdates, regression testing, and release artifact generation.
- Implemented GitHub Actions to automate
- [BREAKING] API Changes:
LlamaChatMessage.rolenow returns aLlamaChatRoleenum instead of aString. All manual role string comparisons should be updated to use the enum.
- [DEPRECATED] API Changes:
- Default
LlamaChatMessageconstructor (string-based) is now deprecated; use.fromText()or.withContent()instead. LlamaChatMessage.roleStringis deprecated and will be removed in v1.0.
- Default
- Engine Upgrades: Upgraded core
llama.cppto tagb7898. - Robust Media Loading: Support for loading images and audio via both file paths and raw byte buffers.
- Bug Fixes: Improved native resource cleanup and fixed potential null-pointer crashes in the multimodal pipeline.
0.3.0 #
- [BREAKING] Removal of
LlamaService: The legacyLlamaServicefacade has been removed. UseLlamaEnginewithLlamaBackend()instead for all platforms. - LoRA Support: Added full support for Low-Rank Adaptation (LoRA) on all native platforms (iOS, Android, macOS, Linux, Windows).
- Web Improvements: Significantly enhanced the web implementation using
wllamav2 features, including native chat templating and threading info. - Logging Refactor: Implemented a unified logging architecture.
- Native Platforms: Simplified to an on/off toggle to ensure stability.
LlamaLogLevel.nonesuppresses all output; other levels enable default stderr logging. - Web: Supports full granular filtering (Debug, Info, Warn, Error).
- Native Platforms: Simplified to an on/off toggle to ensure stability.
- Stability Fixes: Resolved frequent "Cannot invoke native callback from a leaf call" crashes during Flutter Hot Restarts by refactoring native resource lifecycle.
- Improved Lifecycle: Removed
NativeFinalizerdependency to avoid race conditions. Explicitly calldispose()to release native resources. - Robust Loading: Improved model loading on all platforms with better instance cleanup, script injection, and URL-based loading support.
- Dynamic Adapters: Implemented APIs to dynamically add, update scale, or remove LoRA adapters at runtime.
- LoRA Training Pipeline: Added a comprehensive Jupyter Notebook for fine-tuning models and converting adapters to GGUF format.
- API Enhancements: Updated
ModelParamsto include initial LoRA configurations and introducedsupportsUrlLoadingfor better platform abstraction. - CLI Tooling: Updated the
basic_appexample to support testing LoRA adapters via the--loraflag.
0.2.0+b7883 #
- Project Rebrand: Renamed package from
llama_darttollamadart. - Pure Native Assets: Migrated to the modern Dart Native Assets mechanism (
hook/build.dart). - Zero Setup: Native binaries are now automatically downloaded and bundled at runtime based on the target platform and architecture.
- Version Alignment: Aligned package versioning and binary distribution with
llama.cpprelease tags (starting withb7883). - Logging Control: Implemented comprehensive logging interception for both
llamaandggmlbackends with configurable log levels. - Performance Optimization: Added token caching to message processing, significantly reducing latency in long conversations.
- Architecture Overhaul:
- Refactored Flutter Chat Example into a clean, layered architecture (Models, Services, Providers, Widgets).
- Rebuilt CLI Basic Example into a robust conversation tool with interactive and single-response modes.
- Cross-Platform GPU: Verified and improved hardware acceleration on macOS/iOS (Metal) and Android/Linux/Windows (Vulkan).
- New Build System: Consolidated all native source and build infrastructure into a unified
third_party/directory. - Windows Support: Added robust MinGW + Vulkan cross-compilation pipeline.
- UI Enhancements: Added fine-grained rebuilds using Selectors and isolated painting with RepaintBoundaries.
0.1.0 #
- WASM Support: Full support for running the Flutter app and LLM inference in WASM on the web.
- Performance Improvements: Optimized memory usage and loading times for web models.
- Enhanced Web Interop: Improved
wllamaintegration with better error handling and progress reporting. - Bug Fixes: Resolved minor UI issues on mobile and web layouts.
0.0.1 #
- Initial release.
- Supported platforms: iOS, macOS, Android, Linux, Windows, Web.
- Features:
- Text generation with
llama.cppbackend. - GGUF model support.
- Hardware acceleration (Metal, Vulkan).
- Flutter Chat Example.
- CLI Basic Example.
- Text generation with