Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
b21b4f6
expose latencies with the speedup in OnnxDiscrepancyCheck
xadupre Jun 22, 2026
72c5d4e
Extend discrepancy check unit test for latency tuple
Copilot Jun 22, 2026
bf0a978
add time to first token in OnnxDiscrepancyCheck
xadupre Jun 22, 2026
804bb92
Add return type annotation to _measure_speedup
Copilot Jun 22, 2026
1bdee25
Add latency key assertions to fully matching discrepancy test
Copilot Jun 22, 2026
142ddea
Handle zero max_new_tokens in generation metrics
Copilot Jun 22, 2026
39cac1c
Use single measured transformers generation for latency metrics
Copilot Jun 22, 2026
6b5b652
extend command line --test to trigger speedup measure
xadupre Jun 22, 2026
15287d8
Document --test_metrics speedup usage
Copilot Jun 22, 2026
89d98c4
Fix default test metrics to be mae-only, make speedup opt-in
Copilot Jun 22, 2026
7490e55
Fix test to match new default mae-only behavior
Copilot Jun 22, 2026
bf96e3f
Fix test_cli.py expected pass config to include timing_iterations=0
Copilot Jun 22, 2026
d34ecd5
Merge branch 'main' into xadupre/lat
xadupre Jun 25, 2026
141f35a
Merge branch 'main' into xadupre/tts
xadupre Jun 25, 2026
1e8c020
Merge branch 'main' into xadupre/cmd
xadupre Jun 25, 2026
3370626
Merge branch 'main' into xadupre/cmd
xadupre Jun 26, 2026
9c7365d
Merge branch 'main' into xadupre/tts
xadupre Jun 26, 2026
922089a
Merge branch 'main' into xadupre/lat
xadupre Jun 26, 2026
0741621
Merge branch 'xadupre/tts' of https://github.com/microsoft/Olive into…
xadupre Jun 29, 2026
4c7938b
Merge branch 'xadupre/lat' of https://github.com/microsoft/Olive into…
xadupre Jun 29, 2026
fc5c372
Potential fix for pull request finding
xadupre Jun 29, 2026
f1077b7
Potential fix for pull request finding
xadupre Jun 29, 2026
c1e75e6
Potential fix for pull request finding
xadupre Jun 29, 2026
4c73d0a
feat: add llama-cpp integration to OnnxDiscrepancyCheck and llama_env…
Copilot Jun 29, 2026
6dffbab
refactor: use save_pretrained for llama-cpp GGUF conversion and insta…
Copilot Jun 29, 2026
aeef3df
Use convert_hf_to_gguf.py CLI for GGUF conversion instead of custom f…
Copilot Jun 29, 2026
f8192d3
Remove duplicate pytest import inside test function
Copilot Jun 29, 2026
78f77ed
Add --test_llama_path CLI option for specifying llama_env virtual env…
Copilot Jun 29, 2026
d1a40a6
Support comma-separated values for --test_metrics (e.g. mae,speedup)
Copilot Jun 29, 2026
0bc6e0a
fix path
xadupre Jun 29, 2026
3ba9f1c
Store GGUF and HF model files in output_dir instead of temp directory
Copilot Jun 29, 2026
568699b
add missing depenencies
xadupre Jun 29, 2026
b9fa512
Fix convert_hf_to_gguf.py: clone conversion/ directory alongside script
Copilot Jun 29, 2026
55a6515
add missing arguments'
xadupre Jun 29, 2026
b084fc3
Fix CI step: use LLAMA_ENV var and git -C to avoid cd changing cwd
Copilot Jun 29, 2026
c08375f
add missing argument
xadupre Jun 29, 2026
35c191d
lint
xadupre Jun 29, 2026
8a44f3a
Fix 404 error: resolve relative test_model_path to absolute before HF…
Copilot Jun 29, 2026
d65d753
Add num_hidden_layers parameter to OnnxDiscrepancyCheck (default 2)
Copilot Jun 30, 2026
a6dcb8e
update documentation
xadupre Jun 30, 2026
8f4eff2
Fix add_discrepancy_check_pass to update existing pass; add llama_cpp…
Copilot Jun 30, 2026
ca2c094
Fix test model cache persistence + add num_hidden_layers to OnnxDiscr…
Copilot Jun 30, 2026
e76ef03
remove num_hidden_layers
xadupre Jun 30, 2026
d195d9f
Pre-create test model config dir during --dry_run --test
Copilot Jun 30, 2026
bce0baf
Add SaveTestModelConfig pass to create test model config directory
Copilot Jun 30, 2026
0483c52
SaveTestModelConfig pass now saves random model weights in addition t…
Copilot Jun 30, 2026
df9602e
Fix test_metrics not saved in dry_run: always write timing_iterations…
Copilot Jul 1, 2026
34a08e5
Fix missing ref_model_path in test calls and write timing_iterations …
Copilot Jul 1, 2026
5bf4087
Add test_metrics parameter to OnnxDiscrepancyCheck and store it in ge…
Copilot Jul 1, 2026
b97c17e
Fix crash when formatting MAE threshold with test_metrics
Copilot Jul 1, 2026
da0d77d
Add attn_impl parameter to OnnxDiscrepancyCheck for configurable atte…
Copilot Jul 1, 2026
cc15cbf
Rename Olive CLI config from config.json to olive_config.json to prev…
Copilot Jul 1, 2026
5e1512f
Add logging to OnnxDiscrepancyCheck and switch attn_implementation de…
Copilot Jul 1, 2026
6c512b9
todo
xadupre Jul 1, 2026
4e34563
fix
xadupre Jul 1, 2026
cf849f6
Reduce test model dimensions to fix CI artifact size check
Copilot Jul 1, 2026
c27d103
Save CLI dry-run config as config.json instead of olive_config.json
Copilot Jul 1, 2026
1e8a0d1
Merge 3 existing PRs for OnnxDiscrepancyCheck + llama.cpp integration…
Copilot Jul 1, 2026
230cdec
refactor
xadupre Jul 1, 2026
5dd9bf4
Keep only Olive config + test model in output_path for optimize --test
Copilot Jul 1, 2026
811ad65
Keep optimized ONNX model in output_path/model instead of temp dir fo…
Copilot Jul 1, 2026
99c5e02
Add olive_config.json save + first_token_20/tft/tf5t generation metrics
Copilot Jul 1, 2026
50f7a4f
Fix fast test: set GPTQ group_size=32 for tiny test model (hidden_siz…
Copilot Jul 1, 2026
31ecf23
documentation
xadupre Jul 1, 2026
5068b0f
Merge branch 'xadupre/merged' of https://github.com/microsoft/Olive i…
xadupre Jul 1, 2026
37a4e48
Move attn_impl to SaveTestModelConfig; OnnxDiscrepancyCheck uses save…
Copilot Jul 1, 2026
38eea95
Merge remote-tracking branch 'origin/xadupre/merged' into xadupre/merged
Copilot Jul 1, 2026
d44373f
Fix int32 JSON serialization error in OnnxDiscrepancyCheck results
Copilot Jul 1, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions .github/workflows/test-model-fast.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,18 @@ jobs:
python -m pip install -r requirements.txt
python -m pip install -r test/requirements-test-cpu.txt

- name: Create llama_env and install llama-cpp-python
run: |
LLAMA_ENV="$(pwd)/llama_env"
python -m venv "$LLAMA_ENV"
"$LLAMA_ENV/bin/pip" install --upgrade pip
"$LLAMA_ENV/bin/pip" install gguf safetensors llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
"$LLAMA_ENV/bin/pip" install transformers sentencepiece protobuf tabulate gguf
git clone --depth=1 --filter=blob:none --sparse https://github.com/ggerganov/llama.cpp.git /tmp/llama_cpp_repo
git -C /tmp/llama_cpp_repo sparse-checkout set convert_hf_to_gguf.py conversion --skip-checks
cp /tmp/llama_cpp_repo/convert_hf_to_gguf.py "$LLAMA_ENV/"
cp -r /tmp/llama_cpp_repo/conversion "$LLAMA_ENV/"

- name: pip freeze
run: |
python -m pip freeze
Expand Down
67 changes: 14 additions & 53 deletions docs/source/how-to/cli/cli-fast-test.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,10 @@

If you are converting a large language model, it is often useful to validate the Olive command, environment, and conversion recipe on a much smaller model before spending time on the full checkpoint.

The `--test` option does that for Hugging Face models. Olive keeps the same model architecture, reduces it to a random 2-layer test model, saves it to the folder you provide, and reuses that folder on later runs.
The `--test` option does that for Hugging Face models. Olive keeps the same model architecture, reduces it to a random **2-layer** test model, saves it to the folder you provide, and reuses that folder on later runs.

This example uses [`Qwen/Qwen3-0.6B`](https://huggingface.co/Qwen/Qwen3-0.6B), but the same pattern works for other supported Hugging Face LLMs.

## Step 1: generate the workflow config

Start by generating the config that Olive will run for the Qwen conversion.

```bash
Expand All @@ -17,61 +15,24 @@ olive optimize \
--provider CPUExecutionProvider \
--precision int4 \
--output_path out/qwen \
--dry_run
--test out/qwen-test-model
```

This creates `out/qwen/config.json` without launching the full conversion yet.

## Step 2: run a fast smoke test with `olive run --test`

Use the generated config with `olive run` and pass `--test` so Olive swaps in a reduced random Qwen model.

```bash
olive run \
--config out/qwen/config.json \
--test out/qwen-test-model \
--output_path out/qwen-test-run
```

What this does:

- `--test out/qwen-test-model` creates a reduced random Qwen model and saves it in `out/qwen-test-model`
- later runs reuse the same saved test model instead of recreating it
- `--output_path out/qwen-test-run` gives the smoke test its own output folder, so the generated ONNX artifacts are easy to find
- Olive marks that output folder as a test-only run and refuses to reuse a non-test conversion folder for `--test`

After the smoke test finishes, look under `out/qwen-test-run` for the exported ONNX model and related files.

This is a quick way to confirm that:

- Olive can load the source model
- the selected optimization recipe is valid for your setup
- the conversion path completes before you run the full model

If you omit the folder and just pass `--test`, `olive run` will save the reduced model under `<output_path>/test_model`.

## Step 3: run the full conversion

Once the smoke test succeeds, rerun the conversion on the full Qwen checkpoint by removing `--test`.

```bash
olive run \
--config out/qwen/config.json \
--output_path out/qwen-full
```
Because this example runs without `--dry_run`, it produces:

At this point you know the Olive command and the conversion recipe already worked on the lightweight test model, so you can focus on the full-model run instead of debugging both at once.
- `out/qwen/olive_config.json` — the Olive configuration used for the run (named `olive_config.json` so it is never confused with the model's own `config.json`).
- `out/qwen/model/` — the optimized ONNX model.
- `out/qwen/discrepancy_check_results.json` — the discrepancy report.

## Why keep the test model folder?
It also inserts an `OnnxDiscrepancyCheck` pass (if one is not already present) that will compare the generated ONNX model against the 2-layer reference model.

The saved test model is useful beyond the first smoke test:
Additional metrics can be requested via `--test_metrics` (space- or comma-separated):

- you can rerun the reduced conversion quickly while iterating on options
- you can reuse the same HF test model later when comparing the Hugging Face model against the exported ONNX model
- you avoid recreating a new random test checkpoint every time
- `speedup`: ONNX-vs-PyTorch inference latency
- `first_token_20`: compares the first generated token (over a 20-token generation) between ONNX Runtime GenAI and transformers
- `tft`: time to the first generated token (reported for both ONNX Runtime GenAI and transformers)
- `tf5t`: time to the first 5 generated tokens (reported for both ONNX Runtime GenAI and transformers)

## Related docs
For example, `--test_metrics mae,speedup,first_token_20,tft,tf5t`. The generation metrics (`first_token_20`, `tft`, `tf5t`) use the optimized ONNX model directory as the ONNX Runtime GenAI model when it contains a `genai_config.json` (as produced by the model builder).

- [How to use the `olive optimize` command to optimize a Pytorch model](cli-optimize)
- [How to write a new workflow from scratch](../configure-workflows/build-workflow)
- [CLI reference](../../reference/cli)
> **Note:** `--test_metrics` is always respected even when the config was generated by `olive optimize --test`, because Olive updates the existing `OnnxDiscrepancyCheck` settings each time `olive run --test` is invoked.
Loading
Loading