Skip to content

Merge 3 existing PR related to OnnxDiscrepancyCheck + llama.cpp integration#2546

Open
xadupre wants to merge 62 commits into
mainfrom
xadupre/merged
Open

Merge 3 existing PR related to OnnxDiscrepancyCheck + llama.cpp integration#2546
xadupre wants to merge 62 commits into
mainfrom
xadupre/merged

Conversation

@xadupre

@xadupre xadupre commented Jun 29, 2026

Copy link
Copy Markdown
Member

Describe your changes

Merges #2536, #2535, #2534.

Additionally adds llama.cpp integration and other improvements to OnnxDiscrepancyCheck:

  • New llama_cpp flag (bool, default False) on OnnxDiscrepancyCheck — when enabled, converts the reference HuggingFace model to GGUF format using convert_hf_to_gguf.py from llama.cpp and compares inference with llama.cpp.
  • New llama_cpp_env_path parameter (Optional[str]) — path to the llama_env virtual environment where llama-cpp-python and convert_hf_to_gguf.py are installed (defaults to "llama_env" relative to cwd). The virtual environment also isolates potentially conflicting versions of torch, transformers, etc.
  • compare_llama_cpp() method — saves the reference model and tokenizer to output_dir/hf_model (alongside the saved test model/report output) using save_pretrained (standard HuggingFace format), then:
    1. Calls convert_hf_to_gguf.py (the official llama.cpp conversion CLI) inside llama_env via subprocess to produce a GGUF F32 file at output_dir/model.gguf.
    2. Runs a self-contained inference helper script inside llama_env via subprocess.run to measure first-token latency with llama_cpp.Llama.
      Results include first-token match vs PyTorch, llama_cpp_ttft_s, llama_cpp_ttfn_s, llama_cpp_total_time_s, llama_cpp_speedup_vs_pytorch, and llama_cpp_speedup_vs_onnx. All llama_cpp/gguf imports are strictly isolated to the subprocess — the main Olive process never imports them.
  • New --test_llama_path CLI option — specifies the path to the llama_env virtual environment when running olive optimize --test. When provided, the injected OnnxDiscrepancyCheck pass automatically enables llama_cpp=True and forwards the path as llama_cpp_env_path. Using --test_llama_path without --test emits a warning.
  • Improved --test_metrics parsing — now accepts both space-separated (--test_metrics mae speedup) and comma-separated (--test_metrics mae,speedup) forms. A _parse_test_metrics type function handles splitting and validation per token, and a _flatten_test_metrics helper normalises the result before it is forwarded to the pass.
  • Fixed add_discrepancy_check_pass update-in-place — when a config was previously generated by olive optimize --dry_run --test, the OnnxDiscrepancyCheck pass was already present and was silently skipped on subsequent olive run --test --test_metrics … calls. The function now updates the existing pass in-place, refreshing reference_model_path (resolved to absolute), report_output_dir, metric settings (max_mae / timing_iterations), and llama.cpp settings — so --test_metrics, --output_path, and --test_llama_path from olive run always take effect.
  • Fixed test model persistence across engine cache hitsModelBuilder now copies the test model directory to reference_hf_model/ alongside the generated ONNX output in the engine cache. OnnxDiscrepancyCheck falls back to this cached copy when reference_model_path (e.g. out/tiny-test) no longer exists on disk, fixing the OSError: Repo id must be in the form 'repo_name' or 'namespace/repo_name' error that occurred on subsequent olive run calls that hit the model-builder cache.
  • New SaveTestModelConfig pass (olive/passes/pytorch/save_test_model_config.py) — a new Olive pass injected at the beginning of the passes list when --test is active. It takes an HfModelHandler, writes config.json (with the reduced hidden-layer count) and the Olive test-model marker file to test_model_path, and returns the model unchanged. This ensures test_model_path always exists as a config-only directory before ModelBuilder or any downstream pass needs it, replacing the previous _save_test_model_config_for_dry_run standalone function.
  • CI workflow (test-model-fast.yml) — new step that creates a llama_env virtual environment, installs gguf, safetensors, llama-cpp-python (from pre-built CPU wheels at https://abetlen.github.io/llama-cpp-python/whl/cpu), transformers, sentencepiece, and protobuf, and downloads convert_hf_to_gguf.py from the llama.cpp GitHub repository.
  • Updated documentation (cli-fast-test.md) — clarifies WHERE the 2-layer reduction happens (_apply_test_model_config in olive/common/hf/utils.py, called during the model-builder pass of olive run) and WHEN out/tiny-test is created (by the SaveTestModelConfig pass on first olive run, completed with weights by ModelBuilder); documents the cache fallback behaviour; explains the --test_llama_path option and llama_env setup; and clarifies that --test_metrics is always respected even when the config was generated by olive optimize --dry_run --test.

Checklist before requesting a review

  • Add unit tests for this change.
  • Make sure all tests can pass.
  • Update documents if necessary.
  • Lint and apply fixes to your code by running lintrunner -a
  • Is this a user-facing change? If yes, give a description of this change to be included in the release notes.

(Optional) Issue link

@xadupre xadupre marked this pull request as ready for review June 29, 2026 13:02
Copilot AI review requested due to automatic review settings June 29, 2026 13:02

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR consolidates three earlier changes around OnnxDiscrepancyCheck to improve test-mode metrics: it exposes latency values alongside speedup, adds time-to-first-token style generation metrics, and extends olive run --test so users can opt into additional metrics via --test_metrics.

Changes:

  • Extend OnnxDiscrepancyCheck.compare_generation to return a metrics dict including TTFT / time-to-first-N for both transformers and ORT GenAI.
  • Update speedup measurement to return and persist average PyTorch/ONNX latencies plus the computed speedup.
  • Add CLI support for selecting --test metrics (--test_metrics mae|speedup) and update/extend tests + docs accordingly.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
olive/passes/onnx/discrepancy_check.py Adds/returns latency metrics for generation and speedup runs; logs formatted latency values.
olive/cli/base.py Introduces --test_metrics, injects discrepancy pass based on selected metrics, and warns when --test_metrics is unused.
olive/cli/run.py Wires --test_metrics through to the injected discrepancy pass and warning helper.
test/passes/onnx/test_discrepancy_check.py Updates unit tests for new generation result shape and new speedup return tuple.
test/cli/test_cli.py Adjusts CLI workflow test config to reflect timing being disabled in certain cases.
test/cli/test_base.py Adds focused tests for default metric selection behavior and warning behavior.
docs/source/how-to/cli/cli-fast-test.md Documents metric selection for --test runs.

Comment thread olive/cli/base.py
Comment thread docs/source/how-to/cli/cli-fast-test.md Outdated
Comment thread olive/passes/onnx/discrepancy_check.py Outdated
@xadupre

xadupre commented Jun 29, 2026

Copy link
Copy Markdown
Member Author

@copilot in job Ubuntu test model fast, add a step which create a virtual environment called llama_env and install llama-cpp-python, then in OnnxDiscrepancy check, add a flag called llama-cpp which converts the dumped model into gguf format, measures the first token difference and the latencies and speed up, every use of llama-cpp is done through the virtual environment

xadupre and others added 4 commits June 29, 2026 15:29
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot AI commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

@copilot Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
...

The CI failure (AssertionError: assert 1348232 < 1048576) was caused by the test model's model.safetensors (1.28 MB) exceeding the MAX_ARTIFACT_SIZE_BYTES (1 MB) limit in test_cli_test_model_smoke.py.

Fixed in commit a6da5b2 by reducing the tiny LLaMA test model dimensions: hidden_size 128 → 64 and intermediate_size 256 → 128. The resulting safetensors is ~345 KB (well under 1 MB).

If the traceback you pasted is different from this CI assertion failure, could you share the full traceback? The comment appears truncated.

@xadupre

xadupre commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

@copilot you are wrong, optimize --dry_run produces olive_config.json and telemetry is looking to config.json

Copilot AI commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

@copilot you are wrong, optimize --dry_run produces olive_config.json and telemetry is looking to config.json

You're right — the previous diagnosis was wrong. This PR had renamed the CLI-generated config from config.json to olive_config.json in _save_config_file, but olive run --config .../config.json (and the wizard/docs) expect config.json, hence the FileNotFoundError: out/tiny/config.json. Reverted _save_config_file (and wizard.py, docs, tests) to write config.json. The package pass-registry olive/olive_config.json is untouched.

@xadupre

xadupre commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

@copilot The following two command line do not produce any onnx file, then create a new pass converting the model into gguf
python -m olive optimize
--model_name_or_path arnir0/Tiny-LLM --device cpu --provider CPUExecutionProvider
--precision fp32 --output_path out/tiny --dry_run
--test out/tiny-test --test_metrics mae,speedup
--test_llama_path /home/xadupre/vv/env_olive
--log_level 2
python -m olive run --config out/tiny/config.json --log_level 2

Copilot AI commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

@xadupre I've opened a new pull request, #2548, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot AI and others added 2 commits July 1, 2026 12:40
… (with dedicated GGUF conversion pass) (#2548)

## Describe your changes

Merges #2536, #2535, #2534.

Additionally adds llama.cpp integration and other improvements to
`OnnxDiscrepancyCheck` and test-mode workflow handling:

- **New `llama_cpp` flag** (`bool`, default `False`) on
`OnnxDiscrepancyCheck` — when enabled, compares inference with
llama.cpp.
- **New `llama_cpp_env_path` parameter** (`Optional[str]`) — path to the
`llama_env` virtual environment where `llama-cpp-python` and
`convert_hf_to_gguf.py` are installed (defaults to `"llama_env"`
relative to cwd).
- **New `--test_llama_path` CLI option** — specifies the path to the
`llama_env` virtual environment when running with `--test`. Using
`--test_llama_path` without `--test` emits a warning.
- **New `ConvertHfToGGUF` pass**
(`olive/passes/pytorch/convert_hf_to_gguf.py`) — injected when
`--test_llama_path` is provided. This pass converts the test HF model to
GGUF ahead of discrepancy checking and stores the GGUF path in model
attributes for downstream reuse.
- **`compare_llama_cpp()` updates** — now reuses a preconverted GGUF
when available; otherwise it falls back to in-method HF→GGUF conversion.
llama.cpp comparison failures are captured in discrepancy results
(status/failures) instead of aborting the whole run, so ONNX generation
can still complete.
- **Improved `--test_metrics` parsing** — now accepts both
space-separated (`--test_metrics mae speedup`) and comma-separated
(`--test_metrics mae,speedup`) forms.
- **Fixed `add_discrepancy_check_pass` update-in-place** — existing
discrepancy-pass config generated by dry-run is updated in-place so
current `--test_metrics`, `--output_path`, and llama settings are
applied.
- **Fixed test model persistence across engine cache hits** —
`ModelBuilder` stores a reference HF copy (`reference_hf_model/`)
alongside cached ONNX outputs; discrepancy check falls back to this copy
if the original test model path is missing.
- **New `SaveTestModelConfig` pass**
(`olive/passes/pytorch/save_test_model_config.py`) — injected at the
start of passes for `--test`; ensures test model config/marker (and
random test model persistence path usage) is set up before downstream
passes.
- **CI workflow** (`test-model-fast.yml`) — includes setup of a llama
environment and llama.cpp conversion script dependencies.
- **Updated documentation** (`cli-fast-test.md`) — clarifies where layer
reduction happens, when test-model directories are created, cache
fallback behavior, and llama.cpp test flow including the dedicated GGUF
conversion pass.

## Checklist before requesting a review
- [ ] Add unit tests for this change.
- [ ] Make sure all tests can pass.
- [ ] Update documents if necessary.
- [ ] Lint and apply fixes to your code by running `lintrunner -a`
- [ ] Is this a user-facing change? If yes, give a description of this
change to be included in the release notes.

## (Optional) Issue link

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
@xadupre

xadupre commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

@copilot I run this python -m olive optimize --model_name_or_path arnir0/Tiny-LLM --device cpu --provider CPUExecutionProvider --precision fp32 --output_path out/tiny --test out/tiny-test --test_metrics mae,speedup --test_llama_path /home/xadupre/vv/env_olive --log_level 2, bit i see 3 model saved, there should be one, only the test model

  • config.json is overwritten by the model configuration, it should contain olive config
  • the saved model should be done in --test <test_path>, discrepancies as well
  • the guff pass automatically added should store the output in test path

Copilot AI commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

@copilot I run this python -m olive optimize --model_name_or_path arnir0/Tiny-LLM --device cpu --provider CPUExecutionProvider --precision fp32 --output_path out/tiny --test out/tiny-t...

Fixed. For a combined olive optimize ... --test <path> run, output_path (e.g. out/tiny) is now treated as a report directory:

  • out/tiny/config.json now holds the Olive config (saved before the workflow runs, so the model builder's transformers config.json no longer overwrites it).
  • The optimized ONNX model, its transformers config.json, and the reference_hf_model copy are routed to a temp working dir and discarded, so they no longer clutter output_path.
  • The only persisted model is the small test model at the --test path (out/tiny-test).
  • out/tiny also keeps discrepancy_check_results.json plus the test-output marker.

This only changes the one-step optimize/quantize --test (non-dry-run) path; the two-step optimize --dry_run + olive run flow is unchanged.

@xadupre

xadupre commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

@copilot do not use temporary directory like you said in The optimized ONNX model, its transformers config.json, and the reference_hf_model copy are routed to a temp working dir and discarded, so they no longer clutter output_path., keep the produced model in the folders, and i still don't see any onnx model

@xadupre

xadupre commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

@copilot can you save the olive configuration into olive_config.json if dry_run is not enabled

@xadupre

xadupre commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

@copilot add metrics first_token_20, it computes the first token difference with onnxruntime-genai compare to transformers, add metrics tft, time to first token, tf5t, time to first 5 tokens in OnnxDiscrepancyCheck

Copilot AI commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

@copilot do not use temporary directory like you said in *The optimized ONNX model, its transformers config.json, and the reference_hf_model copy are routed to a temp working dir and d...

Fixed in 5b8bab5. Removed the temporary-directory routing — the optimized ONNX model is now saved to <output_path>/model/ and kept on disk. The Olive config.json stays at the <output_path> root (so the model builder's transformers config.json in the subfolder no longer overwrites it), and discrepancy_check_results.json remains at the root too. So optimize ... --test <path> produces the ONNX model in <output_path>/model/ and the small test model at the --test path.

Copilot AI commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

@copilot can you save the olive configuration into olive_config.json if dry_run is not enabled

Done in 99c5e02. When --dry_run is not enabled, the Olive config is now saved as olive_config.json (so it is never confused with the model's own config.json). --dry_run still writes config.json so olive run --config <output_path>/config.json continues to work.

@copilot add metrics first_token_20, it computes the first token difference with onnxruntime-genai compare to transformers, add metrics tft, time to first token, tf5t, time to first 5 tokens in OnnxDi...

Added in 99c5e02. OnnxDiscrepancyCheck (and --test_metrics) now support:

  • first_token_20: compares the first generated token between ONNX Runtime GenAI and transformers over a 20-token generation (reports both first tokens, a match flag, and matching leading tokens).
  • tft: time to first token (transformers + GenAI).
  • tf5t: time to first 5 tokens (transformers + GenAI).

The generation metrics use the optimized ONNX model directory as the GenAI model when it contains a genai_config.json. Results are logged and written to discrepancy_check_results.json.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants