Open work items for this repo. Cross-cutting tracking lives in
../workspace/crossrepostatus.md;
items here are jllama-specific or are this repo's slice of a
cross-cutting initiative.
The PIT mutation gate reaches 100% only when the audio test fixture is present. Without it the
run is 98%: 4 NO_COVERAGE mutants in value.ContentPart.audioFile(Path) (the null file-name
guard, the .wav/.mp3 extension dispatch, and Files.readAllBytes). The only test that exercises
that method is AudioInputIntegrationTest, which is model-/fixture-gated and self-skips (Assume)
when no audio clip is supplied (net.ladenthin.llama.audio.input — no committed default). So any
environment lacking the clip (e.g. a network-restricted sandbox) reds the gate. Fix: add a hermetic
temp-file unit test for audioFile(Path) — write a few bytes to a @TempDir *.wav / *.mp3 and
assert the format dispatch — mirroring the existing imageFile(Path) temp-file tests (PNG/JPG/GIF/
WEBP), which already make the image path hermetic. See
../workspace/policies/pit-mutation-testing.md §4.
A multi-area audit (2026-06-20) of the existing codebase surfaced 18 correctness/safety findings,
intentionally split into tiers so each could land as its own small, focused PR. All 18 are now
fixed and merged — Tiers 1–3 in #258, the deferred LlamaLoader extraction race in #260 —
with regression tests added in #261 / #262. The full per-finding rationale lives in those PRs and
their commits; the concise record below is kept for traceability. Nothing in this section is open
except the optional follow-up noted at the end.
LlamaLoader native-lib extraction temp-path race — DONE (atomic write + content-reuse).
extractFile now (1) reuses a byte-identical existing copy instead of rewriting it — so it never
replaces a file another JVM has already loaded (which fails on Windows) — and (2) otherwise extracts
to a per-attempt unique temp file and atomically moves it into place, so a concurrent loader can
never observe a half-written library. jllama is statically linked (BUILD_SHARED_LIBS OFF), so the
extracted file is self-contained — no multi-DLL co-location to coordinate. Verified by the
NativeLibraryLoadSmokeTest (real extract+load on macOS) + a resourceMatchesFile unit test; the
Windows locked-replace path is exercised by CI's Windows jobs.
Tier 1 — high impact (#258, a4325ff)
- N1 — unhandled C++ exceptions crossing the JNI boundary → JVM abort; every entry point (incl. the
public
LlamaModel.jsonSchemaToGrammar, plus encode/tokenize/embeddings/rerank/infill/applyTemplate) now converts the failure to aLlamaExceptioninstead of crashing the process. - N2 —
parse_string_arraynull-deref (null element / OOM) + per-iteration JNI local-ref leak. - J1 —
close()/ nativedelete()double-free under concurrent close →synchronizedclose. - P1 —
ServerMetricscumulative token totals truncatedint→ negative →Timings.promptN/predictedNwidened tolong.
Tier 2 — medium (#258, a4325ff + 3e500aa)
- S1 — unbounded request body → OOM DoS → 16 MiB cap +
Content-Lengthpre-check; oversized → HTTP 413. - N3 — streaming-reader use-after-free → reader held as a
shared_ptrand copied out under the lock beforenext(). - J5 —
Sessionpermanently wedged on an abandoned stream →Session.cancelStream()clears the guard and rolls back the pending user turn. - J3 —
LlamaIterator.hasNextmadevolatile(observed across a cross-threadcancel()). - N4 — log callback made
noexcept+ non-throwing env lookup (no exception unwinds through llama.cpp C frames from an unattached thread).
Tier 3 — hardening (#258, ac3ad6d)
- S3 — constant-time bearer-key comparison (
MessageDigest.isEqual). - S2 — SSE heartbeat pool sized to the core count (one stalled client can't starve other streams).
- P3 —
ChatMessage.toolCallsdefensively copied + wrapped unmodifiable. - NaN/Inf — non-finite
float/doublerejected atJsonParameters.withScalar(they would serialize to the invalid JSON tokensNaN/Infinity). - OSInfo — armhf-detection
exec()routed through a drain-and-close helper (no fd leak / pipe-full hang). - completeBatch —
completeBatch/completeBatchWithStats/chatBatchjoin every future before propagating the first failure (no abandoned in-flight requests). - Docs —
/props+ Ollama discovery routes documented as intentionally-unauthenticated metadata;parseProbabilitiesdocumented as last-wins on duplicate token text (useparseLogprobsfor lossless data).
Still open — optional follow-up (lower priority): full per-process extraction directory isolation
- a
cleanup()that recursively removes dead-process dirs. Now that writes are atomic and content-checked this is a tidiness improvement (stops the shared-tmpdircleanup()racing a live peer's flat file), not a correctness fix — and it still needs the Windows locked-file co-design noted above.
net.ladenthin.llama.server.OpenAiCompatServer is the single OpenAI-compatible server (JDK
com.sun.net.httpserver, no new dependency, fat-jar Main-Class). It exposes the OpenAI routes
POST /v1/chat/completions (streaming SSE + non-streaming), /v1/completions, /v1/embeddings,
/v1/rerank, /infill, GET /v1/models, GET /health and GET /props, plus three alternative
protocol surfaces — Ollama-native (/api/version, /api/tags, /api/show, /api/chat,
/api/generate), Anthropic Messages (POST /v1/messages) and OpenAI Responses (POST /v1/responses).
Every route is also reachable without the /v1 prefix and sits behind a CORS filter. The CLI is parsed
by the testable OpenAiServerCli. (Consolidated from PR #240's JDK + streaming server and #242's
NanoHTTPD server; NanoHTTPD + its dependency deleted.)
IDE/agent backend hardening — DONE (from the deep-research investigation
docs/feature-investigation-ide-agent-backend.md;
primary goal: agentic tool-calling with Qwen):
- Agentic tool-calling verified wire-correct: C++ guard pins
tool_calls.function.argumentsas a JSON string (not object) at b9739 (llama.cpp #20198), plus the existingfinish_reason:"tool_calls"test. stream_options.include_usageforwarded (newInferenceParameters.withStreamOptions) so the trailing usage chunk is emitted, andOpenAiSseFormatter.ensureUsageCachedTokensguaranteesusage.prompt_tokens_details.cached_tokens(fixes the Copilot custom-endpoint crash, vscode #273482).response_format(json_object/json_schema) forwarded for structured outputs.POST /infill(FIM autocomplete for llama.vscode/Twinny/Tabby/Continue) → nativehandleInfill.POST /v1/rerank(RAG) →handleRerankreshaped toresults/data(OaiRerankSupport).- CORS preflight +
Access-Control-Allow-Origin; bare-path (no/v1) aliases;cache_prompt=truedefault;--mmproj(vision),--embedding,--rerankingCLI flags. - Alternative protocol surfaces (pure translation over the OpenAI core; tool calls reconstructed by
ToolCallDeltaAccumulator): Ollama-native (/api/version,/api/tags,/api/show,/api/chatwith NDJSON streaming,/api/generateprompt-completion/FIM —OllamaApiSupport;/api/showadvertises tools/insert/vision + context length); Anthropic Messages (POST /v1/messages, SSE events —AnthropicApiSupport+AnthropicStreamTranslator); OpenAI Responses (POST /v1/responses, SSE events —ResponsesApiSupport+ResponsesStreamTranslator). GET /props(llama.cpp-native):default_generation_settings.n_ctx+modalitiesso autocomplete clients (llama.vscode) size their context window (OpenAiSseFormatter.propsJson).- Gated integration round-trips over a real socket, run in CI's
test-java-linux-x86_64job, self-skipping when the model is absent — structural assertions only:OpenAiCompatServerIntegrationTest(Qwen3-0.6B, chat mode): OpenAI chat (non-stream/stream/tools/ models) plus Ollama/api/chat+ discovery, Anthropic/v1/messages, OpenAI/v1/responses(non-stream + stream) and/props.OpenAiServerEmbeddingsIntegrationTest(CodeLlama-7B +enableEmbedding):/v1/embeddings(+ bare alias).OpenAiServerRerankIntegrationTest(jina-reranker +enableReranking):/v1/rerank(sortedresults/data,top_ncap).OpenAiServerCompletionIntegrationTest(CodeLlama-7B):/v1/completions,/infill, and Ollama/api/generate(plain + FIM viasuffix).
Open follow-ups (deferred):
- Streaming raw-completion path — IN PROGRESS (no new native method needed). The earlier premise was
wrong: a streaming raw-completion JNI path already exists (
requestCompletion/receiveCompletionJson, exposed asLlamaModel.generate(InferenceParameters) → LlamaIterable), so this is Java-only server wiring, not JNI/C++. Progress: (a) streamingPOST /v1/completions— DONE (OpenAiRequestMappertoCompletionParameters+OpenAiBackend.streamCompletionsdrivinggenerate()+ anOpenAiSseFormatter.completionChunktext_completionchunk + thestreamCompletionsSSE handler; HTTP test green). Remaining: (b) token-streaming Ollama/api/generate(translate thetext_completionchunks to NDJSON, mirroring the chat→Ollama translator) and (c) Continue's nativePOST /completionroute in the llama.cpp-native streaming shape ({"content":…,"stop":…}per chunk). - Future output modalities (audio / image) — design note, not yet actionable. llama.cpp's server
produces text (plus embeddings/rerank); it does not generate images or audio output, so there is
no engine behind a TTS/image-gen response today and building that API surface now would be dead code.
When/if it becomes real, the integration points are already isolated: a new
OpenAiBackend.stream*primitive + anOpenAiSseFormatter.*Chunkformatter per modality, wired into a per-route handler — the exact shape the textstreamCompletionspath now establishes. Two concrete future hooks: (1) llama.cpp's OuteTTS audio path (if it lands in the embedded server) → an/v1/audio/speech-style route emitting audio chunks; (2) routing image/audio generation to an external model behind the same server (the binding would proxy, not generate). KeepLlamaOutput/chunk formatters modality-neutral so neither requires reworking the streaming core. - Incremental tool-call streaming on the alternative surfaces. Ollama/Anthropic/Responses emit each
tool call whole at end-of-stream (reconstructed by
ToolCallDeltaAccumulator) rather than streaming argument fragments. Fine for clients that apply tool calls after generation; revisit if a client needs incrementalinput_json_delta/function_call_arguments.deltafidelity. - Per-model FIM template registry (Qwen/CodeLlama/DeepSeek v1&V2/StarCoder2/Codestral) — only needed
if we also expose
/v1/completions-with-suffixFIM;/infill(and Ollama/api/generatewith asuffix) applies the model's FIM tokens server-side, so this is lower value. - Multi-model registry. Only one model id is advertised/served today; serving several would need multi-model load + lifecycle management.
- Manual real-client validation. Gated server-side round-trips now exist for every surface (above). What remains is manual validation against the actual editor clients — point Copilot's Ollama provider / a Custom Endpoint, Claude Code, and a Responses client at the running server — since a server-side round-trip confirms the wire shapes but not each client's own parser.
- Gemma 4 tool-calling validation. Confirm the pinned llama.cpp (
b9789) includes the Gemma 4 tool-call parser fixes; if not, bump per the upgrade procedure. - NativeServer — wire upstream
server.cpproutes to JNI (in progress; scaffold landeddd264b2). The upstream HTTP transport (tools/server/server-http.cpp+ the cpp-httplib backend) is already compiled intolibjllama, and aserver.NativeServerJava scaffold +NativeServerSmokeTestlanded indd264b2. Remaining: wire the upstreamserver.cpproute table (the one upstream TU still excluded from the build — it carriesmain()+ route wiring) to JNI so the native HTTP server (and the embedded WebUI) can be started/stopped from Java. This is the native-transport alternative to the JDK-basedOpenAiCompatServer(which is complete and the primary surface); value is shipping the full llama.cpp server + WebUI in-process without a separatellama-serverbinary. JNI + C++ work.
Design decision (do not revisit without the owner): the MSVC / Visual Studio build is the
default JAR and is kept permanently — never retired. The Ninja Multi-Config build is shipped
alongside it as the ninja-windows classifier JAR, never as a replacement. The loss of the
sccache cache on the MSVC build is accepted; the Ninja build exists so a cache-accelerated,
independently validated second Windows artifact is available for users to compare/adopt.
Why two builds. The cache mechanism is the CMake compiler launcher
(-DCMAKE_C_COMPILER_LAUNCHER=sccache). The Visual Studio generator ignores it entirely
(only Ninja/Makefile generators honor it), so the MSVC jobs can never cache. The Ninja
Multi-Config generator does honor it (upstream llama.cpp b9739 ships windows-cuda this way,
proving Ninja Multi-Config + MSVC works on the same tree). The two builds produce different
jllama.dlls, so they cannot coexist at the same resource path in one JAR — hence the classifier.
What shipped (this branch):
- 4 Windows build jobs, all permanent:
build-windows-x86_64,build-windows-x86(MSVC, default JAR) andbuild-windows-x86_64-ninja,build-windows-x86-ninja(Ninja + sccache/Depot). - Both tested end-to-end: all four run the C++ unit tests (
ctest);test-java-windows-x86_64(MSVC) and the newtest-java-windows-x86_64-ninja(Ninja) both load the DLL via JNI and run the full model-backed Java suite. .github/build.bat— sccache probe guard (mirrorsbuild.sh'ssccache_can_wrap_compiler()):USE_CACHE=true+sccacheon PATH + a trivial TU compiling throughsccache cl.exe⇒-DCMAKE_{C,CXX}_COMPILER_LAUNCHER=sccache+sccache --show-stats; else green uncached. Inert for the MSVC jobs (they don't setUSE_CACHE).pom.xml—windows-ninjaprofile →<classifier>ninja-windows</classifier>JAR from${project.build.outputDirectory}_windows_ninja(mirrors thecuda/opencl-androidprofiles).publish.yml— thepackage,publish-snapshot,publish-releasejobs downloadWindows-{x86_64,x86}-ninjaintosrc/main/resources_windows_ninja/and activate thewindows-ninjaprofile; the Ninja build + Java-test jobs are in thepackageneeds:graph.- Docs:
README.mdclassifier table +CLAUDE.md"Windows Ninja artifact" section.
Verification — DONE (PR #248). The Ninja jobs are green and cache-warm: Build and Test Windows … (Ninja … sccache, eval) builds + ctest pass, and Java Tests Windows 2025 x86_64 (Ninja, eval) loads the DLL via JNI and runs the full model-backed suite green (after the b9739
arg-parse patch landed). sccache --show-stats confirms cache hits on the Ninja jobs.
Optional follow-up: smoke-test that the published ninja-windows classifier JAR loads its DLL
on a clean Windows host. Publishing is gated behind publish_to_central, so a broken Windows job
blocks the release before any artifact reaches Central/GitHub Releases.
Reference notes:
- Cache backend is sccache + Depot WebDAV (consistent with the other 8 jobs — one token, shared
cross-branch) rather than upstream's per-branch ccache. sccache supports MSVC
cl.exe; the Release config emits no debug info, so the/Zi→/Z7PDB caveat doesn't apply. - It is "Ninja Multi-Config", not plain Ninja — it keeps multi-config semantics, so
cmake --build … --config Releaseand the config-specificRUNTIME_OUTPUT_DIRECTORY_RELEASEproperties behave exactly as under the VS generator;/MTruntime and x64-vs-x86 gating unchanged. - The arch (
x64/x86) comes fromilammy/msvc-dev-cmd@v1, not a-Aflag (Ninja takes no-A).
Status: FIXED via local source patch (patches/0001-win32-arg-parse-embed-guard.patch). Surfaced
while bringing PR #248 green (the b9739 build fixes let the Windows Java jobs run to completion and
exposed this). Applied through the generic patches/ mechanism (see CLAUDE.md "Local llama.cpp source
patches"), so it covers every C++ build and re-applies on each clean build.
Note on the fix shape (count-guard → deterministic removal). The first patch used fix option 1
below — the count-guard (override only when the re-derived arg count equals argc). It fixed 21/25
Windows Java tests, but collided on the 4 server-integration setups (OpenAiServerRerank*,
OpenAiServerToolCalling*, MultimodalIntegrationTest, OpenAiCompatServerIntegrationTest) whose
argv length happened to equal java.exe's, so they kept failing with the same parse error. The patch
was changed to fix option 2 (drop the override entirely for our build — a JNI library is never the
process, so the override is pure liability), which is deterministic. As of the b9789 bump the patch
was reshaped into the clean opt-in form intended for upstreaming (fix option 3's core):
common_params_parse now parses exactly the argv it is given, and a new common_params_parse_main()
wrapper carries the GetCommandLineW UTF-8 recovery that the standalone tools' main() opt into.
The patch now carries the full upstream change (37 files): the ~34 common_params_parse(argc, argv, …) call sites across tools/*, examples/* and the tests/* programs flip to
common_params_parse_main(), plus a tests/test-arg-parser.cpp regression case. Embedded callers stay
on common_params_parse. Our subproject build compiles only the arg.{cpp,h} core
(LLAMA_BUILD_TOOLS/TESTS OFF), so the flips + test are validated via a one-off tools+tests build
(the new test's asserts pass; test-arg-parser's only red is the live ggml.ai download check, which
is sandbox-network). The 37-file patch must be re-verified on each llama.cpp bump (the applier fails
loud). Submit it to llama.cpp and drop the local copy once merged.
Symptom. On Windows x86_64 only, every Java test that loads a real model fails in
LlamaModel.loadModel (native) with LlamaException: "Failed to parse model parameters"
(25 errors in Java Tests Windows 2025 x86_64, both the VS and Ninja DLLs). macOS and Linux Java
tests pass. The argv we build is platform-neutral (--model models/<file>.gguf, relative, forward
slashes — TestConstants.MODEL_PATH), so it is not the Windows-Ninja build, not our argv,
and not a path/escaping issue.
Root cause (upstream llama.cpp, new in b9739). jllama.cpp (load_model_impl, ~line 606) builds
a CLI argv from ModelParameters and calls upstream
common_params_parse(argc, argv, params, LLAMA_EXAMPLE_SERVER). In b9739, common/arg.cpp's
common_params_parse gained a Windows-only prologue (arg.cpp:924-931):
bool common_params_parse(int argc, char ** argv, ...) {
#ifdef _WIN32
auto utf8 = make_utf8_argv(); // = CommandLineToArgvW(GetCommandLineW())
if (!utf8.ptrs.empty()) { // always non-empty under a JVM
argc = (int) utf8.buf.size();
argv = utf8.ptrs.data(); // DISCARDS the caller-supplied argv
}
#endif
... common_params_parse_ex(argc, argv, ctx_arg) ...
}It unconditionally replaces the caller's argv with the host process command line
(GetCommandLineW()). For the standalone llama-server.exe this is correct (fixes UTF-8 CLI args).
For an embedded/JNI caller the process is java.exe, whose command line has no --model, so
common_params_parse_ex fails and common_params_parse returns false → our "Failed to parse model
parameters". common_params_parse_ex is static, so we cannot bypass the block by calling the inner
parser. Our JNI already passes correct UTF-8 argv (GetStringUTFChars), so the re-derivation is
unnecessary for us. This is an upstream bug affecting every embedded Windows consumer of
common_params_parse.
Fix options (history — option 2 chosen). (1) guard the block by arg-count — tried first, it
collided (see the count-guard note above); (2) remove the _WIN32 override for our build — CHOSEN
(deterministic; our JNI always passes correct UTF-8 argv); (3) file an upstream PR and wait. The patch
re-applies on every llama.cpp bump and the applier fails loud if it stops applying — it is part of the
upgrade checklist. Pre-existing on main since #247 (b9682→b9739); independent of the Windows-Ninja
classifier work. Remaining open item: the upstream PR (see "Upstream llama.cpp PR" below) so the
local patch can eventually be dropped.
The PR's only red is SonarCloud's "Security Rating on New Code" gate (every build/test job is green; SonarCloud is not a merge-blocking build job). The findings are GitHub-Actions/Java analyzer issues from the Maven scanner — "C" is the rating grade (A–E), not the C language; there is no CFamily/C-C++ scan configured. Addressed:
clang-format.yml—pip installwithout--only-binary :all:can run a package'ssetup.py; forced wheels-only (84297e0, block scalar so:all:doesn't break YAML). If Sonar still flags it, try the--only-binary=:all:equals form.osv-scanner.yml/scorecard.yml— top-levelpermissions: read-all→contents: read(84297e0); safe because every job in both files already declares its own exact permissions.publish.yml— workflow-levelpermissions: contents: read(Sonar wants it per-job); owner marked it Accept/"Won't fix" on the dashboard rather than spreading perms across ~25 release jobs. Alternative if ever desired: addpermissions: contents: readto the ~19 read-only jobs (the 5 publish/report jobs already declarecontents: write) and drop the top-level block.PairTest.java— 3 Critical Reliability bugs (assertNotNullon the primitivehashCode()) replaced with a determinism check (9f0d377). Reliability rating, not the Security gate.
Still open: the gate was still red as of 9f0d377. SonarCloud's issues API is auth-gated (403 from
CI), so the exact remaining new-code Vulnerability must be read off the dashboard. Resolve the last
finding, accept it on the dashboard, or merge on the green build/test checks.
Separate from the FSFE REUSE check (which is green — reuse lint reports 266/266 files compliant)
and from SonarCloud: the PR's combined commit status shows a "License Compliance" check failing with
"17 issues found" (an error-state commit status posted by a license-scanner GitHub App, not a
workflow in .github/workflows/). It contributes to the mergeable_state: blocked on #248.
- Almost certainly pre-existing, not introduced by this PR: #248 changes no dependencies (the
pom.xmledit only adds thewindows-ninjabuild profile), so the 17 are dependency-license policy findings already present onmain(e.g. GPL-2.0 carried by the llama.cpp sources). - Not yet inspected — the scanner's dashboard/host is outside this sandbox's egress allowlist, same
as
sonarcloud.io. To triage: open the check's details link from the PR (or allowlist the host), read the 17 findings, then accept policy-OK licenses on the dashboard or adjust the policy. Confirm whether it is a required status (if so it blocks merge; if advisory it does not).
patches/0001-win32-arg-parse-embed-guard.patch is a local fix re-applied on every build. To drop
it, PR upstream (against #24779): add a common_params_parse_argv companion (or a
common_params_parse opt-out flag) that trusts the caller's argv — preserving the standalone tools'
UTF-8 fix while letting embedders (JNI, and any FFI binding) pass their own argv. Ship with the
standalone-safe repro (a plain exe that passes a synthetic argv and shows it gets discarded on Windows
because GetCommandLineW() returns the host process line). Once merged and the pin is bumped past it,
delete the patch.
The native aarch64 switch renamed the check Cross-Compile Linux aarch64 (LTS) → Build and Test Linux aarch64. If a required status check pinned the old name, repoint it or it will sit pending
forever.
These are JNI plumbing items for upstream API additions. Policy: add only after a real user request — they are mostly relevant to specific model families or specialized workflows.
-
Expose
--spec-draft-backend-samplingtoggle viaModelParameters.setSpecDraftBackendSampling(boolean). Added in b9437 (envLLAMA_ARG_SPEC_DRAFT_BACKEND_SAMPLING). Backend sampling for the speculative draft is enabled by default upstream but auto-disabled onLLAMA_SPLIT_MODE_TENSORsetups; an explicit Java-side setter lets callers force-disable it for benchmarking or for backends with sampler bugs. Speculative-decoding power users. -
Expose runtime reasoning control via
InferenceParameters.setReasoningControl(boolean)+LlamaModel.endReasoning(...). Added in b9444–b9490: newcommon_params_sampling::reasoning_controlflag arms the budget sampler so reasoning can be ended at runtime, and newcommon_sampler_reasoning_budget_force(common_sampler *)triggers the end-of-thinking token injection on the next sample. Upstream also adds aPOST /v1/chat/completions/controlserver endpoint accepting{"id": "...", "action": "reasoning_end"}. Java mapping would be: (a)InferenceParameters.setReasoningControl(boolean)arms the sampler on the inference run, (b) a newLlamaModel.endReasoning(int slotId)(or per-streaming-task-id) JNI method calls the upstreamcommon_sampler_reasoning_budget_forceagainst the slot's sampler. Useful for interactive UIs that want a "skip thinking and answer now" button. Relevant only for reasoning-trained models (DeepSeek-R1, Qwen3-Thinking, GPT-OSS-Reasoner, etc.). -
Expose
llama_context_params::n_outputs_maxviaModelParameters.setMaxOutputs(int). Added in b9444–b9490 (default-1= derived fromn_batch). Caps the number of output slots allocated per context; relevant for memory-constrained setups that always run withlogits_all=falseand want to prevent over-allocation whenn_batchis large. Trivial JNI plumbing (onecparamsfield passthrough); add when a user reports OOM on context creation tied to output slot pre-allocation. -
Expose Multi-Token Prediction toggle via
ModelParameters.setMtp(boolean). Existed since the Qwen3.5 MTP work; b9444–b9490 extends it to Step-3.5. CLI flags--mtp/--no-mtp(envLLAMA_ARG_MTP) control whether the draft head runs alongside the main model for accelerated decoding. Java setter would route tocommon_params_speculative::type = COMMON_SPECULATIVE_TYPE_DRAFT_MTP. Relevant only for MTP-trained models. -
Expose
llama_vocab::get_suppress_tokens()viaLlamaModel.getSuppressTokens(). Added in b9490–b9495 alongside the newtokenizer.ggml.suppress_tokensGGUF key and theLLM_KV_TOKENIZER_SUPPRESS_TOKENSconstant. When a GGUF declares this array, upstream stores it onllama_vocab::impl::suppress_tokensand exposes it via the newllama_vocab::get_suppress_tokens()accessor. The bias is applied automatically inside the model forward graph — the Gemma4 Unified graph (src/models/gemma4.cpp) reads the list and adds a-INFINITYlogit bias to those token IDs via a newllm_graph_input_logits_biasinput so the model cannot emit them (used to block<image|>/<audio|>placeholders). A Java mirror would bepublic int[] getSuppressTokens()onLlamaModel: a read-only inspector returning the suppression list for debugging or for callers running their own sampling who want to replicate the same bias. Value is low (the bias is auto-applied, Java callers cannot change it; java-llama.cpp does not expose custom logit-bias hooks at this level); cost is trivial (one JNI passthrough + agetSuppressTokens()Java method).
Raised by @vaiju1981 in PR #251 comment. Feel free to contribute fixes — PRs welcome.
-
Unhandled C++ exceptions cross the JNI boundary → JVM abort (UB). Any
std::exception(or worse, an exception of unknown type) that escapes a native method and crosses the JNI boundary causes undefined behaviour on most JVMs and typically aborts the process. Each native method injllama.cppshould wrap its body intry { … } catch (const std::exception& e) { env->ThrowNew(llamaExceptionClass, e.what()); return <zero>; } catch (...) { env->ThrowNew(…, "unknown C++ exception"); return <zero>; }so that errors surface asLlamaExceptionon the Java side instead of crashing the JVM. -
parse_string_array— null deref + JNI local-reference leak. The helper that reads a JSON string array from JNI can dereference a null pointer when an array element is absent, and leaks JNI local references when an early exit skips the matchingDeleteLocalRef. Fix: guard everyGetObjectArrayElementresult and pair each reference acquisition with aDeleteLocalRefbefore the next iteration or return. -
close()/ nativedelete()double-free under concurrent close. If two threads race to callLlamaModel.close(), both can reach the nativedeletepath and free the samejllama_contextpointer twice → heap corruption. Fix: useAtomicBoolean closed+ asynchronizedguard (orcompareAndSet) on the Java side soclose()is idempotent and the native pointer is nulled before the second caller can reach it. -
ServerMetrics.getCumulativeTimings()truncates cumulative token totals toint. The cumulative token counters are stored aslongin the JSON but cast tointwhen constructingServerMetrics, silently truncating values aboveInteger.MAX_VALUE(~2.1 billion tokens). Fix: widen the field and constructor parameter tolong. -
Unbounded request-body read → OOM DoS. The HTTP handler reads the entire request body into a
String/byte[]before parsing it, with no size cap. A client that streams a multi-gigabyte body can exhaust heap memory and crash the JVM. Fix: add a configurablemaxRequestBodyByteslimit (e.g. default 4 MB) and reject oversized requests withHTTP 413 Content Too Largebefore buffering them.
- Feature backlog from similar projects. See
docs/feature-investigation-similar-projects.mdfor the consolidated investigation across the 5 pure-Java sibling runtimes (llama3.java, gemma4.java, gptoss.java, qwen35.java, nemotron3.java) plus the dormant alternative JNI binding llamacpp4j. The doc captures 18 candidate items grouped into cross-cutting themes (UTF-8 streaming boundary safety, thinking-channel router, operator timing line, jbang single-file example, README system-properties table, etc.) and per-repo unique findings (Harmony channel decoder, Qwen empty-<think>injection, llama_state_* save/load, llama_adapter_lora_* hot-apply, etc.), each with effort sizing (XS / S / M / L) and a prioritised backlog.- Recommended first batch (items 1, 3, 4, 5): UTF-8 boundary-safe streaming decoder +
per-run timing line+ one jbang-runnable example +a README system-properties table; ~1-2 days total, no JNI changes. - DONE so far:
- README system-properties table (
e36f631, with two cleanups in3ae6c81+28dc9e6). - Per-run timing line (
TimingsLoggerclass + wire-in toCompletionResponseParserandChatResponseParser; format mirrors whatllama.cppCLI prints —prompt: N tok in X ms (Y tok/s) | gen: … | cache: N | draft: …; dedicated SLF4J loggernet.ladenthin.llama.timingsso users can suppress it independently; 7 unit tests pin format + pipeline behaviour).
- README system-properties table (
- Remaining first-batch items: UTF-8 boundary-safe streaming decoder + jbang example.
- Recommended first batch (items 1, 3, 4, 5): UTF-8 boundary-safe streaming decoder +
-
Publish a proper Android AAR alongside the existing JAR-with-resources packaging. Today java-llama.cpp already cross-compiles the Android arm64 native lib in two flavours (CPU-only, bundled into the main JAR; OpenCL/Adreno under classifier
opencl-android-aarch64), but both ship as plain Maven JARs that burylibjllama.soundernet/ladenthin/llama/Linux-Android/aarch64/. Android/Gradle consumers expect an.aarwith anAndroidManifest.xml, the native lib underjni/arm64-v8a/, and Maven coordinates likenet.ladenthin:llama-android:<version>@aar. This is the format the LLaMAndroid integration referenced elsewhere in this file has to work around manually. Investigate usingcom.android.libraryvia Gradle in a sibling module, or hand-rolling the AAR layout from the Maven build. Coordinate ABI coverage with any future armv7-a / x86_64 work so the AAR can declare multiplejniLibs/<abi>/entries when those land. -
Provide a Kotlin-friendly façade + Android sample app. The pure-Java
LlamaIterable/LlamaModelAPI works on Android today (LLaMAndroid wraps it in a Kotlinflow {}block), but a small first-party Kotlin module — coroutineFlow<LlamaOutput>adapters,suspendvariants of the blocking calls, idiomaticuse {}resource handling — would lower the integration cost meaningfully and serve as the canonical reference for downstream consumers. Pair it with a minimal sample app (singleActivity, model picker, streaming text view) under e.g.examples/android-sample/so the AAR has an exercised end-to-end path in CI. Treat LLaMAndroid as the prior-art baseline; reuse patterns that already work there.
-
Evaluate GraalVM Native Image as an alternative distribution target. Reference: GraalVM Native Image. The pure-Java sibling projects in the README's "Similar Projects" list (mukel's
llama3.java/gemma4.java/gptoss.java/qwen35.java/nemotron3.java) demonstrate that single-jar, no-JNI Java inference is viable for individual model architectures. Native Image opens an orthogonal direction for THIS project: AOT-compile the Java layer + JNI bridge to a self-contained binary that bundles the libjllama.so (or per-OS equivalent) and starts in milliseconds without a JVM, which would make jllama usable in CLI tools, serverless functions, and short-lived processes where JVM startup is the dominant cost.What to investigate before committing:
- JNI-loading shape. Native Image supports JNI but requires
--enable-native-access=ALL-UNNAMED+ reflection/JNI configuration files (reflect-config.json,jni-config.json,resource-config.json) describing every class/method/field reachable across the JNI boundary. The 17 native methods injllama.cppplus the JNI-sideFindClass/GetFieldID/GetMethodIDcalls atJNI_OnLoadneed to be mapped. The GraalVM tracing agent (-agentlib:native-image-agent=config-output-dir=...) can auto-generate the config during a representative test run, but theLlamaLoaderJAR-extraction path needs at least one resource-config rule fornet/ladenthin/llama/{OS}/{ARCH}/lib*.so. - Native-library packaging. The current
LlamaLoaderextracts the OS-specific.so/.dll/.dylibfrom the JAR to a tmp dir at first use. Native Image needs the same file at AOT-execution time, so either (a) ship the native lib alongside the produced binary as a sidecar file and adjustLlamaLoaderto find it on the same directory, or (b) embed the native lib as a resource and keep the existing extract-to-tmpdir flow (which Native Image supports viaresource-config.json). - CUDA / Metal / OpenCL backend selection. Today the choice between CPU-only /
cuda13-linux-x86-64/opencl-android-aarch64JARs is at Maven-classifier time. Native Image would need either one binary per backend (multiplying the release matrix) or a runtime selector insideLlamaLoaderthat picks among bundled backend libs. The latter is a bigger refactor. - Startup-time benchmark to justify the work. Measure cold-start of a current java-llama.cpp
LlamaModel(new ModelParameters().setModel("...").setNPredict(1))invocation: how much is JVM startup + class load vs JNI load + model parse + tokenize + 1 token? If JVM startup is < 10 % of cold-start, Native Image yields little. If JVM startup is > 50 %, it's a clear win for CLI / serverless use cases. - Maintenance cost. Native Image adds a second build matrix (per OS × per backend × per JDK) and a new failure surface (Native Image config drift when a llama.cpp version bump adds new JNI-reachable types). Should ship only with a CI job that exercises the Native Image build on at least one OS, otherwise the config files will rot silently.
Out of scope until evidence supports it: actually implementing any of the above. This entry exists so that when someone asks "can I ship java-llama.cpp as a single 30 MB binary?" the answer points to a concrete investigation plan rather than restarting from zero.
- JNI-loading shape. Native Image supports JNI but requires
-
jqwik pin policy — see
../workspace/policies/jqwik-prompt-injection.md.jqwik.version ≤ 1.9.3is mandatory. -
@VisibleForTestingaudit. No usages currently. Walk the production tree for package-private/protected methods or fields that exist purely so tests can reach them, and either annotate (com.google.common.annotations.VisibleForTesting) or move into the test source tree. -
Null-safety refinement. JSpecify + NullAway are now enforced at compile time in strict JSpecify mode with the extra options
CheckOptionalEmptiness,AcknowledgeRestrictiveAnnotations,AcknowledgeAndroidRecent,AssertsEnabled(seepom.xml);@NullMarkedon the three packages viapackage-info.java; JDK module exports in.mvn/jvm.config. The legacyorg.jetbrains.annotationsdep has been removed; all nullability annotations are JSpecify. Public-API methods that may legitimately have no value useOptional<T>rather than@Nullable T(ChatResponse.getFirstMessage,ChatMessage.getParts,ChatRequest.buildToolsJson). Open follow-up: review remaining unannotated public API surfaces for places where@Nullablewould be more precise than the implicit non-null default. -
SpotBugs
effort=Max+threshold=Low— DONE (already enabled inpom.xml), with fb-contrib + findsecbugs, bound toverify. The legacy "flip the pom / ~65 findings" note is stale: only a handful of unexcluded findings remain at any time, andspotbugs:checkis kept green. Most recent pass fixed the 6 introduced by the audit Tier-1–3 fixes —withScalaruses a singleinstanceof Number(noITC_INHERITANCE_TYPE_CHECKING);ChatMessage.getToolCallsreturns a fresh unmodifiable view (noEI_EXPOSE_REP); theLlamaModelbatch methods' deliberate re-throw and theChatMessagepublic constructor'sListparam carry narrow<Match>rationale suppressions. Note:spotbugs:checkis bound to theverifyphase, which the model-backed CI test jobs (mvn test/mvn package) do not reach — runmvn verify(or a dedicated job) to gate it in CI. -
Drop the project-wide
OPM_OVERLY_PERMISSIVE_METHODsuppression inspotbugs-exclude.xmlonce the package-architecture refactor lands (see../workspace/crossrepostatus.mdunder "Affects BAF + jllama (multi-package repos)"). The single-root package today makes every "method called only by same-package callers → could be package-private" finding correct-but-unstable; once layers split, cross-layer calls will need public. Snapshot at suppression (07109cc): 25 sites. The same rule is suppressed in BAF (52c8c95) for identical reasons. -
Additional ArchUnit rules to consider — the full
layeredArchitecture()rule and a per-module banned-import rule (jacksonBannedFromContractsAndLoader— Jackson kept out ofargs/callback/exception/loader) are now DONE. Still open: more per-module banned-imports if useful, public-API-surface constraints (no public mutable static state, etc.). Partial progress:7b6667dcovers the "no public field that is not final" sub-rule. -
Cross-repo code-quality TODOs — see
../workspace/policies/code-quality-todos.mdfor the canonical@VisibleForTestingdesign-fit review, package hierarchy review, and class/method naming review. This repo has no@VisibleForTestingusages today; package and naming reviews remain open.
- llama.cpp b9682 → b9739 (#247, merged) + build fixes:
server-schema.cppadded to thejllama_testsources (b9739 link fix,38be6db);test_server.cppParamsFromJsonCmplexpectations updated to b9739 schema behavior (aaba886). - Windows Ninja artifact —
ninja-windowsclassifier JAR built with Ninja Multi-Config + sccache, shipped alongside the permanent MSVC default; both build + Java-test jobs green (e113ed3,48f0863). (See the open section above for the design rationale; verification is done.) - Linux aarch64 → native
ubuntu-24.04-armbuild (ed9ecbb). The dockcrosslinux-arm64-ltsimage (GCC 8.5 / glibc 2.17) could no longer compile b9739's C++17 CTAD-in-new; now builds natively with GCC 14 (mirroring upstream), runscteston real ARM (446 tests green), and warms sccache (99.66% hits). Trade-off: glibc floor 2.17 → ~2.39 (same envelope as upstream's ARM binaries); documented in the README classifier table.build.shsccache auto-fetch generalized to aarch64. - Generic
patches/mechanism — drop*.patch/*.diffin repo-rootpatches/, applied to the FetchContent'd llama.cpp source bycmake/apply-llama-patches.cmakevia the llama.cppPATCH_COMMAND(cross-platform, idempotent, fail-loud). Covers every C++ build from one place. First patch fixes the Windows JNI arg-parse regression (1d875b1→ deterministic formf651b53). REUSE annotated viapatches/**glob (0cffac1). - CUDA sccache verified — the
manylinux_2_28 (CUDA)job caches all gcc C/C++ TUs (247/248 hits, 99.60%); the nvcc.cukernels remain uncached (sccache limitation), andCUDA_FAST_BUILDkeeps PR/validation runs single-arch. (Doc/observation; no code change.)
The flat net.ladenthin.llama root package was split (via git mv, history
preserved) into layered packages so boundaries align with the layers, enforced
by a new layeredArchitecture() ArchUnit rule (Api → Loader → Marshalling →
Foundation):
- Foundation:
value(18 DTOs: ChatMessage, ContentPart, Pair, LlamaOutput, …),callback(CancellationToken, LoadProgressCallback, ToolHandler),exception(LlamaException, ModelUnavailableException),args(existing leaf). - Marshalling:
json(response parsers +TimingsLogger, its only consumer),parameters(Inference/Model/Json/Cli parameters +ParameterJsonSerializer+ChatRequest). - Loader (internal, NOT exported):
loader(LlamaLoader, OSInfo, ProcessRunner, NativeLibraryPermissionSetter, Java8CompatibilityHelper, OfflineModelGuard, LlamaSystemProperties). - Api (root): LlamaModel, Session, LlamaIterable, LlamaIterator.
Cycle-breaking moves: TimingsLogger root→json, ParameterJsonSerializer
json→parameters, ChatRequest root→parameters (it carries an
InferenceParameters customizer). Test classes mirrored into their subjects'
packages; cross-layer members promoted to public. Cross-package Javadoc
{@link} references fully-qualified (palantir's removeUnusedImports strips
javadoc-only imports). module-info exports the new public-API packages and
keeps loader internal. All 11 ArchUnit rules green; javadoc:jar clean.
Breaking change: public-API FQNs changed (e.g. net.ladenthin.llama.ChatMessage
→ net.ladenthin.llama.value.ChatMessage) — ship under a major version bump.
-
Reactive
LlamaPublisherremoved in favour of consumer-side adapters. The hand-rolledLlamaPublisher+LlamaModel.streamPublisher/streamChatPublisher(shipped in PR #188 as §2.3 of the Kotlin SDK feature comparison) had zero non-test callers.LlamaIterableis alreadyIterable<LlamaOutput> & AutoCloseable, and every mainstream reactive library wraps it in a few lines via its own resource-management primitive (Flux.using,Flowable.using, Kotlinuse {}). The real-world Android consumer LLaMAndroid already usesLlamaIterableinside a Kotlinflow {}block — bypassing the publisher entirely. README "Reactive integration" section documents the Reactor / RxJava 3 / Kotlin Flow / Akka patterns; correctness is pinned end-to-end by a newReactorIntegrationTestusing test-scopereactor-core(zero runtime deps added;org.reactivestreamsruntime dep dropped). Cleared 6 fb-contrib Max+Low findings onLlamaPublisher$LlamaSubscriptionas a side effect. -
Error Prone bug-pattern promotions to
ERROR—855f447(12 patterns promoted;-Xlint:allenabled). -
javac -Werror+-Xlint:all,-serial,-options,-classfile,-processing—3e2efbb. ~20 EP warnings addressed first (EqualsGetClass onPairvia instanceof; MissingOverride onPoolingType/RopeScalingType; JdkObsoleteLinkedList→ArrayListinLlamaLoader; StringSplitter inline-suppressed; 3× StringCaseLocaleUsageLocale.ROOTinOSInfo; EmptyCatch inOSInfo.isAlpineLinux; FutureReturnValueIgnored inLlamaModel.completeAsync; Finalize onLlamaModel.finalize; MixedMutabilityReturnType in 4 parser methods; EnumOrdinal inInferenceParameters.setMiroStat; EscapedEntity inInferenceParametersjavadoc; 4× TypeParameterUnusedInFormals; AnnotateFormatMethod onJava8CompatibilityHelper.formatted; SafeVarargs + varargs onJava8CompatibilityHelper.listOf). -
-parametersjavac arg —4350cf2. -
--release N—4350cf2(<release>8</release>). -
Mutation-testing threshold enforcement (PIT) —
62f8a00+bb93a8f(docs) +3bfa51f(README badge). Runs every CI build with<mutationThreshold>100</mutationThreshold>. Scope expanded 2026-06-07 from the original singlePairtarget (which was stale after the restructure —llama.Pair→value.Pairmatched nothing) tovalue.*+exception.*+args.*+json.TimingsLogger= 27 classes / 163 mutations, all killed. Still open (optional):json.ChatResponseParser/CompletionResponseParserprivate-helper survivors (RerankResponseParseris excluded — equivalent empty-list mutant). -
Checker Framework as a second static-nullness pass —
c63870b. The original@PolyNullonJsonParameters.toJsonStringwas simplified to plain@Nullable(the only@PolyNullsite in production; eliminated in a later cleanup). Native-method constructor calls inLlamaModelcarry@SuppressWarnings("method.invocation")(Checker's@UnderInitializationcannot see that the native callee does not dereferencethis);Pair.equalsandUsage.equalsdeclare@Nullable Object;LlamaSystemPropertiesgetters return@Nullable String;getPackage()and resource-stream null derefs are guarded. -
JPMS
module-info.javawith module-level@NullMarked—0fd066a+9528e79. The modulenet.ladenthin.llamaexports the three hand-written public packages (net.ladenthin.llama,.args,.json). Two-executionmaven-compiler-pluginpattern; module-level@NullMarkedlives on the module descriptor. -
Banned-API enforcement — Maven Enforcer (
8baae0c), ArchUnitSystem.exit/new Random/Thread.sleep(329d764),sun.*/com.sun.*/jdk.internal.*(e6069da). -
ArchUnit public-fields-final —
7b6667d. -
LogCaptor smoke test —
LoggingSmokeTest(3cedc6e). -
Offline / air-gapped model loading —
ModelFlag.OFFLINE+ModelParameters.setOffline(boolean)+hasFlaghelper + publicModelUnavailableException(extends now-publicLlamaException) + deterministic pre-checkOfflineModelGuard. Unit tests inLlamaModelOfflineTest. No JNI rebuild required. (Originally shipped asSKIP_DOWNLOAD/setSkipDownloadover a parse-failure heuristic; reworked when llama.cpp b9803 removedcommon_params::skip_downloadandcommon_skip_download_exception—--skip-downloadwas never a registered upstream arg, so it never actually skipped a download.--offlineis the real upstream flag with the intended load-from-cache semantics.) -
LlamaSystemPropertiesregistry cleanup —getLibName()deleted (6bb63e1upstream forensic trace);OSInfo.getArchName()now routes throughLlamaSystemProperties.getOsinfoArchitecture()(3ae6c81). -
Abstract the Java and test writing guidelines to a workspace-level shared layer. Workspace version chain at
../workspace/guides/src/CODE_WRITING_GUIDE-8.mdand../workspace/guides/test/TEST_WRITING_GUIDE-8.md; canonical TDD skill at../workspace/.claude/skills/java-tdd-guide/SKILL.md. -
Standardised CLAUDE.md template —
../workspace/templates/CLAUDE.md.template.