Merge 3 existing PR related to OnnxDiscrepancyCheck + llama.cpp integration by xadupre · Pull Request #2546 · microsoft/Olive

xadupre · 2026-06-29T11:38:07Z

Describe your changes

Merges #2536, #2535, #2534.

Additionally adds llama.cpp integration and other improvements to OnnxDiscrepancyCheck:

New llama_cpp flag (bool, default False) on OnnxDiscrepancyCheck — when enabled, converts the reference HuggingFace model to GGUF format using convert_hf_to_gguf.py from llama.cpp and compares inference with llama.cpp.
New llama_cpp_env_path parameter (Optional[str]) — path to the llama_env virtual environment where llama-cpp-python and convert_hf_to_gguf.py are installed (defaults to "llama_env" relative to cwd). The virtual environment also isolates potentially conflicting versions of torch, transformers, etc.
compare_llama_cpp() method — saves the reference model and tokenizer to output_dir/hf_model (alongside the saved test model/report output) using save_pretrained (standard HuggingFace format), then:
1. Calls convert_hf_to_gguf.py (the official llama.cpp conversion CLI) inside llama_env via subprocess to produce a GGUF F32 file at output_dir/model.gguf.
2. Runs a self-contained inference helper script inside llama_env via subprocess.run to measure first-token latency with llama_cpp.Llama.
  Results include first-token match vs PyTorch, llama_cpp_ttft_s, llama_cpp_ttfn_s, llama_cpp_total_time_s, llama_cpp_speedup_vs_pytorch, and llama_cpp_speedup_vs_onnx. All llama_cpp/gguf imports are strictly isolated to the subprocess — the main Olive process never imports them.
New --test_llama_path CLI option — specifies the path to the llama_env virtual environment when running olive optimize --test. When provided, the injected OnnxDiscrepancyCheck pass automatically enables llama_cpp=True and forwards the path as llama_cpp_env_path. Using --test_llama_path without --test emits a warning.
Improved --test_metrics parsing — now accepts both space-separated (--test_metrics mae speedup) and comma-separated (--test_metrics mae,speedup) forms. A _parse_test_metrics type function handles splitting and validation per token, and a _flatten_test_metrics helper normalises the result before it is forwarded to the pass.
Fixed add_discrepancy_check_pass update-in-place — when a config was previously generated by olive optimize --dry_run --test, the OnnxDiscrepancyCheck pass was already present and was silently skipped on subsequent olive run --test --test_metrics … calls. The function now updates the existing pass in-place, refreshing reference_model_path (resolved to absolute), report_output_dir, metric settings (max_mae / timing_iterations), and llama.cpp settings — so --test_metrics, --output_path, and --test_llama_path from olive run always take effect.
Fixed test model persistence across engine cache hits — ModelBuilder now copies the test model directory to reference_hf_model/ alongside the generated ONNX output in the engine cache. OnnxDiscrepancyCheck falls back to this cached copy when reference_model_path (e.g. out/tiny-test) no longer exists on disk, fixing the OSError: Repo id must be in the form 'repo_name' or 'namespace/repo_name' error that occurred on subsequent olive run calls that hit the model-builder cache.
New SaveTestModelConfig pass (olive/passes/pytorch/save_test_model_config.py) — a new Olive pass injected at the beginning of the passes list when --test is active. It takes an HfModelHandler, writes config.json (with the reduced hidden-layer count) and the Olive test-model marker file to test_model_path, and returns the model unchanged. This ensures test_model_path always exists as a config-only directory before ModelBuilder or any downstream pass needs it, replacing the previous _save_test_model_config_for_dry_run standalone function.
CI workflow (test-model-fast.yml) — new step that creates a llama_env virtual environment, installs gguf, safetensors, llama-cpp-python (from pre-built CPU wheels at https://abetlen.github.io/llama-cpp-python/whl/cpu), transformers, sentencepiece, and protobuf, and downloads convert_hf_to_gguf.py from the llama.cpp GitHub repository.
Updated documentation (cli-fast-test.md) — clarifies WHERE the 2-layer reduction happens (_apply_test_model_config in olive/common/hf/utils.py, called during the model-builder pass of olive run) and WHEN out/tiny-test is created (by the SaveTestModelConfig pass on first olive run, completed with weights by ModelBuilder); documents the cache fallback behaviour; explains the --test_llama_path option and llama_env setup; and clarifies that --test_metrics is always respected even when the config was generated by olive optimize --dry_run --test.

Checklist before requesting a review

Add unit tests for this change.
Make sure all tests can pass.
Update documents if necessary.
Lint and apply fixes to your code by running lintrunner -a
Is this a user-facing change? If yes, give a description of this change to be included in the release notes.

(Optional) Issue link

… xadupre/merged

Copilot

Pull request overview

This PR consolidates three earlier changes around OnnxDiscrepancyCheck to improve test-mode metrics: it exposes latency values alongside speedup, adds time-to-first-token style generation metrics, and extends olive run --test so users can opt into additional metrics via --test_metrics.

Changes:

Extend OnnxDiscrepancyCheck.compare_generation to return a metrics dict including TTFT / time-to-first-N for both transformers and ORT GenAI.
Update speedup measurement to return and persist average PyTorch/ONNX latencies plus the computed speedup.
Add CLI support for selecting --test metrics (--test_metrics mae|speedup) and update/extend tests + docs accordingly.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`olive/passes/onnx/discrepancy_check.py`	Adds/returns latency metrics for generation and speedup runs; logs formatted latency values.
`olive/cli/base.py`	Introduces `--test_metrics`, injects discrepancy pass based on selected metrics, and warns when `--test_metrics` is unused.
`olive/cli/run.py`	Wires `--test_metrics` through to the injected discrepancy pass and warning helper.
`test/passes/onnx/test_discrepancy_check.py`	Updates unit tests for new generation result shape and new speedup return tuple.
`test/cli/test_cli.py`	Adjusts CLI workflow test config to reflect timing being disabled in certain cases.
`test/cli/test_base.py`	Adds focused tests for default metric selection behavior and warning behavior.
`docs/source/how-to/cli/cli-fast-test.md`	Documents metric selection for `--test` runs.

xadupre · 2026-06-29T13:06:41Z

@copilot in job Ubuntu test model fast, add a step which create a virtual environment called llama_env and install llama-cpp-python, then in OnnxDiscrepancy check, add a flag called llama-cpp which converts the dumped model into gguf format, measures the first token difference and the latencies and speed up, every use of llama-cpp is done through the virtual environment

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

… CI step

Copilot · 2026-07-01T09:19:03Z

@copilot Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
...

The CI failure (AssertionError: assert 1348232 < 1048576) was caused by the test model's model.safetensors (1.28 MB) exceeding the MAX_ARTIFACT_SIZE_BYTES (1 MB) limit in test_cli_test_model_smoke.py.

Fixed in commit a6da5b2 by reducing the tiny LLaMA test model dimensions: hidden_size 128 → 64 and intermediate_size 256 → 128. The resulting safetensors is ~345 KB (well under 1 MB).

If the traceback you pasted is different from this CI assertion failure, could you share the full traceback? The comment appears truncated.

xadupre · 2026-07-01T09:59:32Z

@copilot you are wrong, optimize --dry_run produces olive_config.json and telemetry is looking to config.json

Copilot · 2026-07-01T10:05:35Z

@copilot you are wrong, optimize --dry_run produces olive_config.json and telemetry is looking to config.json

You're right — the previous diagnosis was wrong. This PR had renamed the CLI-generated config from config.json to olive_config.json in _save_config_file, but olive run --config .../config.json (and the wizard/docs) expect config.json, hence the FileNotFoundError: out/tiny/config.json. Reverted _save_config_file (and wizard.py, docs, tests) to write config.json. The package pass-registry olive/olive_config.json is untouched.

xadupre · 2026-07-01T10:13:16Z

@copilot The following two command line do not produce any onnx file, then create a new pass converting the model into gguf
python -m olive optimize
--model_name_or_path arnir0/Tiny-LLM --device cpu --provider CPUExecutionProvider
--precision fp32 --output_path out/tiny --dry_run
--test out/tiny-test --test_metrics mae,speedup
--test_llama_path /home/xadupre/vv/env_olive
--log_level 2
python -m olive run --config out/tiny/config.json --log_level 2

Copilot · 2026-07-01T10:13:25Z

@xadupre I've opened a new pull request, #2548, to work on those changes. Once the pull request is ready, I'll request review from you.

… (with dedicated GGUF conversion pass) (#2548) ## Describe your changes Merges #2536, #2535, #2534. Additionally adds llama.cpp integration and other improvements to `OnnxDiscrepancyCheck` and test-mode workflow handling: - **New `llama_cpp` flag** (`bool`, default `False`) on `OnnxDiscrepancyCheck` — when enabled, compares inference with llama.cpp. - **New `llama_cpp_env_path` parameter** (`Optional[str]`) — path to the `llama_env` virtual environment where `llama-cpp-python` and `convert_hf_to_gguf.py` are installed (defaults to `"llama_env"` relative to cwd). - **New `--test_llama_path` CLI option** — specifies the path to the `llama_env` virtual environment when running with `--test`. Using `--test_llama_path` without `--test` emits a warning. - **New `ConvertHfToGGUF` pass** (`olive/passes/pytorch/convert_hf_to_gguf.py`) — injected when `--test_llama_path` is provided. This pass converts the test HF model to GGUF ahead of discrepancy checking and stores the GGUF path in model attributes for downstream reuse. - **`compare_llama_cpp()` updates** — now reuses a preconverted GGUF when available; otherwise it falls back to in-method HF→GGUF conversion. llama.cpp comparison failures are captured in discrepancy results (status/failures) instead of aborting the whole run, so ONNX generation can still complete. - **Improved `--test_metrics` parsing** — now accepts both space-separated (`--test_metrics mae speedup`) and comma-separated (`--test_metrics mae,speedup`) forms. - **Fixed `add_discrepancy_check_pass` update-in-place** — existing discrepancy-pass config generated by dry-run is updated in-place so current `--test_metrics`, `--output_path`, and llama settings are applied. - **Fixed test model persistence across engine cache hits** — `ModelBuilder` stores a reference HF copy (`reference_hf_model/`) alongside cached ONNX outputs; discrepancy check falls back to this copy if the original test model path is missing. - **New `SaveTestModelConfig` pass** (`olive/passes/pytorch/save_test_model_config.py`) — injected at the start of passes for `--test`; ensures test model config/marker (and random test model persistence path usage) is set up before downstream passes. - **CI workflow** (`test-model-fast.yml`) — includes setup of a llama environment and llama.cpp conversion script dependencies. - **Updated documentation** (`cli-fast-test.md`) — clarifies where layer reduction happens, when test-model directories are created, cache fallback behavior, and llama.cpp test flow including the dedicated GGUF conversion pass. ## Checklist before requesting a review - [ ] Add unit tests for this change. - [ ] Make sure all tests can pass. - [ ] Update documents if necessary. - [ ] Lint and apply fixes to your code by running `lintrunner -a` - [ ] Is this a user-facing change? If yes, give a description of this change to be included in the release notes. ## (Optional) Issue link --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

xadupre · 2026-07-01T10:54:14Z

@copilot I run this python -m olive optimize --model_name_or_path arnir0/Tiny-LLM --device cpu --provider CPUExecutionProvider --precision fp32 --output_path out/tiny --test out/tiny-test --test_metrics mae,speedup --test_llama_path /home/xadupre/vv/env_olive --log_level 2, bit i see 3 model saved, there should be one, only the test model

config.json is overwritten by the model configuration, it should contain olive config
the saved model should be done in --test <test_path>, discrepancies as well
the guff pass automatically added should store the output in test path

Copilot · 2026-07-01T11:06:40Z

@copilot I run this python -m olive optimize --model_name_or_path arnir0/Tiny-LLM --device cpu --provider CPUExecutionProvider --precision fp32 --output_path out/tiny --test out/tiny-t...

Fixed. For a combined olive optimize ... --test <path> run, output_path (e.g. out/tiny) is now treated as a report directory:

out/tiny/config.json now holds the Olive config (saved before the workflow runs, so the model builder's transformers config.json no longer overwrites it).
The optimized ONNX model, its transformers config.json, and the reference_hf_model copy are routed to a temp working dir and discarded, so they no longer clutter output_path.
The only persisted model is the small test model at the --test path (out/tiny-test).
out/tiny also keeps discrepancy_check_results.json plus the test-output marker.

This only changes the one-step optimize/quantize --test (non-dry-run) path; the two-step optimize --dry_run + olive run flow is unchanged.

xadupre · 2026-07-01T11:23:06Z

@copilot do not use temporary directory like you said in The optimized ONNX model, its transformers config.json, and the reference_hf_model copy are routed to a temp working dir and discarded, so they no longer clutter output_path., keep the produced model in the folders, and i still don't see any onnx model

xadupre · 2026-07-01T11:28:05Z

@copilot can you save the olive configuration into olive_config.json if dry_run is not enabled

xadupre · 2026-07-01T11:36:20Z

@copilot add metrics first_token_20, it computes the first token difference with onnxruntime-genai compare to transformers, add metrics tft, time to first token, tf5t, time to first 5 tokens in OnnxDiscrepancyCheck

…r --test

Copilot · 2026-07-01T11:37:53Z

@copilot do not use temporary directory like you said in *The optimized ONNX model, its transformers config.json, and the reference_hf_model copy are routed to a temp working dir and d...

Fixed in 5b8bab5. Removed the temporary-directory routing — the optimized ONNX model is now saved to <output_path>/model/ and kept on disk. The Olive config.json stays at the <output_path> root (so the model builder's transformers config.json in the subfolder no longer overwrites it), and discrepancy_check_results.json remains at the root too. So optimize ... --test <path> produces the ONNX model in <output_path>/model/ and the small test model at the --test path.

Copilot · 2026-07-01T11:57:02Z

@copilot can you save the olive configuration into olive_config.json if dry_run is not enabled

Done in 99c5e02. When --dry_run is not enabled, the Olive config is now saved as olive_config.json (so it is never confused with the model's own config.json). --dry_run still writes config.json so olive run --config <output_path>/config.json continues to work.

@copilot add metrics first_token_20, it computes the first token difference with onnxruntime-genai compare to transformers, add metrics tft, time to first token, tf5t, time to first 5 tokens in OnnxDi...

Added in 99c5e02. OnnxDiscrepancyCheck (and --test_metrics) now support:

first_token_20: compares the first generated token between ONNX Runtime GenAI and transformers over a 20-token generation (reports both first tokens, a match flag, and matching leading tokens).
tft: time to first token (transformers + GenAI).
tf5t: time to first 5 tokens (transformers + GenAI).

The generation metrics use the optimized ONNX model directory as the GenAI model when it contains a genai_config.json. Results are logged and written to discrepancy_check_results.json.

xadupre and others added 20 commits June 22, 2026 13:45

expose latencies with the speedup in OnnxDiscrepancyCheck

b21b4f6

Extend discrepancy check unit test for latency tuple

72c5d4e

add time to first token in OnnxDiscrepancyCheck

bf0a978

Add return type annotation to _measure_speedup

804bb92

Add latency key assertions to fully matching discrepancy test

1bdee25

Handle zero max_new_tokens in generation metrics

142ddea

Use single measured transformers generation for latency metrics

39cac1c

extend command line --test to trigger speedup measure

6b5b652

Document --test_metrics speedup usage

15287d8

Fix default test metrics to be mae-only, make speedup opt-in

89d98c4

Fix test to match new default mae-only behavior

7490e55

Fix test_cli.py expected pass config to include timing_iterations=0

bf96e3f

Merge branch 'main' into xadupre/lat

d34ecd5

Merge branch 'main' into xadupre/tts

141f35a

Merge branch 'main' into xadupre/cmd

1e8c020

Merge branch 'main' into xadupre/cmd

3370626

Merge branch 'main' into xadupre/tts

9c7365d

Merge branch 'main' into xadupre/lat

922089a

Merge branch 'xadupre/tts' of https://github.com/microsoft/Olive into…

0741621

… xadupre/merged

Merge branch 'xadupre/lat' of https://github.com/microsoft/Olive into…

4c7938b

… xadupre/merged

xadupre marked this pull request as ready for review June 29, 2026 13:02

Copilot AI review requested due to automatic review settings June 29, 2026 13:02

Copilot started reviewing on behalf of xadupre June 29, 2026 13:03 View session

Copilot AI reviewed Jun 29, 2026

View reviewed changes

Comment thread olive/cli/base.py

Comment thread docs/source/how-to/cli/cli-fast-test.md Outdated

Comment thread olive/passes/onnx/discrepancy_check.py Outdated

Copilot started work on behalf of xadupre June 29, 2026 13:07 View session

xadupre and others added 4 commits June 29, 2026 15:29

Potential fix for pull request finding

fc5c372

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

f1077b7

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

c1e75e6

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

feat: add llama-cpp integration to OnnxDiscrepancyCheck and llama_env…

4c73d0a

… CI step

Copilot started work on behalf of xadupre July 1, 2026 09:14 View session

Reduce test model dimensions to fix CI artifact size check

cf849f6

Copilot finished work on behalf of xadupre July 1, 2026 09:19

Copilot started work on behalf of xadupre July 1, 2026 10:00 View session

Save CLI dry-run config as config.json instead of olive_config.json

c27d103

Copilot finished work on behalf of xadupre July 1, 2026 10:06

Copilot AI mentioned this pull request Jul 1, 2026

Merge 3 existing PRs for OnnxDiscrepancyCheck + llama.cpp integration (with dedicated GGUF conversion pass) #2548

Merged

5 tasks

Copilot AI and others added 2 commits July 1, 2026 12:40

refactor

230cdec

Copilot started work on behalf of xadupre July 1, 2026 10:54 View session

Keep only Olive config + test model in output_path for optimize --test

5dd9bf4

Copilot finished work on behalf of xadupre July 1, 2026 11:07

Copilot started work on behalf of xadupre July 1, 2026 11:23 View session

Keep optimized ONNX model in output_path/model instead of temp dir fo…

811ad65

…r --test

Copilot finished work on behalf of xadupre July 1, 2026 11:38

Copilot started work on behalf of xadupre July 1, 2026 11:39 View session

Add olive_config.json save + first_token_20/tft/tf5t generation metrics

99c5e02

Copilot finished work on behalf of xadupre July 1, 2026 11:57

Uh oh!

Conversation

xadupre commented Jun 29, 2026 • edited by Copilot AI Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes

Checklist before requesting a review

(Optional) Issue link

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xadupre commented Jun 29, 2026

Uh oh!

Copilot AI commented Jul 1, 2026

Uh oh!

xadupre commented Jul 1, 2026

Uh oh!

Copilot AI commented Jul 1, 2026

Uh oh!

xadupre commented Jul 1, 2026

Uh oh!

Copilot AI commented Jul 1, 2026

Uh oh!

xadupre commented Jul 1, 2026

Uh oh!

Copilot AI commented Jul 1, 2026

Uh oh!

xadupre commented Jul 1, 2026

Uh oh!

xadupre commented Jul 1, 2026

Uh oh!

xadupre commented Jul 1, 2026

Uh oh!

Copilot AI commented Jul 1, 2026

Uh oh!

Copilot AI commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xadupre commented Jun 29, 2026 •

edited by Copilot AI

Loading

Copilot AI commented Jul 1, 2026 •

edited

Loading