microsoft · xadupre · Jun 22, 2026 · Jun 22, 2026 · Jun 22, 2026 · Jun 22, 2026
diff --git a/.github/workflows/test-model-fast.yml b/.github/workflows/test-model-fast.yml
@@ -30,6 +30,18 @@ jobs:
           python -m pip install -r requirements.txt
           python -m pip install -r test/requirements-test-cpu.txt
 
+      - name: Create llama_env and install llama-cpp-python
+        run: |
+          LLAMA_ENV="$(pwd)/llama_env"
+          python -m venv "$LLAMA_ENV"
+          "$LLAMA_ENV/bin/pip" install --upgrade pip
+          "$LLAMA_ENV/bin/pip" install gguf safetensors llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
+          "$LLAMA_ENV/bin/pip" install transformers sentencepiece protobuf tabulate gguf
+          git clone --depth=1 --filter=blob:none --sparse https://github.com/ggerganov/llama.cpp.git /tmp/llama_cpp_repo
+          git -C /tmp/llama_cpp_repo sparse-checkout set convert_hf_to_gguf.py conversion --skip-checks
+          cp /tmp/llama_cpp_repo/convert_hf_to_gguf.py "$LLAMA_ENV/"
+          cp -r /tmp/llama_cpp_repo/conversion "$LLAMA_ENV/"
+
       - name: pip freeze
         run: |
           python -m pip freeze

diff --git a/docs/source/how-to/cli/cli-fast-test.md b/docs/source/how-to/cli/cli-fast-test.md
@@ -2,12 +2,10 @@
 
 If you are converting a large language model, it is often useful to validate the Olive command, environment, and conversion recipe on a much smaller model before spending time on the full checkpoint.
 
-The `--test` option does that for Hugging Face models. Olive keeps the same model architecture, reduces it to a random 2-layer test model, saves it to the folder you provide, and reuses that folder on later runs.
+The `--test` option does that for Hugging Face models. Olive keeps the same model architecture, reduces it to a random **2-layer** test model, saves it to the folder you provide, and reuses that folder on later runs.
 
 This example uses [`Qwen/Qwen3-0.6B`](https://huggingface.co/Qwen/Qwen3-0.6B), but the same pattern works for other supported Hugging Face LLMs.
 
-## Step 1: generate the workflow config
-
 Start by generating the config that Olive will run for the Qwen conversion.
 
 ```bash
@@ -17,61 +15,24 @@ olive optimize \
     --provider CPUExecutionProvider \
     --precision int4 \
     --output_path out/qwen \
-    --dry_run
+    --test out/qwen-test-model
 ```
 
-This creates `out/qwen/config.json` without launching the full conversion yet.
-
-## Step 2: run a fast smoke test with `olive run --test`
-
-Use the generated config with `olive run` and pass `--test` so Olive swaps in a reduced random Qwen model.
-
-```bash
-olive run \
-    --config out/qwen/config.json \
-    --test out/qwen-test-model \
-    --output_path out/qwen-test-run
-```
-
-What this does:
-
-- `--test out/qwen-test-model` creates a reduced random Qwen model and saves it in `out/qwen-test-model`
-- later runs reuse the same saved test model instead of recreating it
-- `--output_path out/qwen-test-run` gives the smoke test its own output folder, so the generated ONNX artifacts are easy to find
-- Olive marks that output folder as a test-only run and refuses to reuse a non-test conversion folder for `--test`
-
-After the smoke test finishes, look under `out/qwen-test-run` for the exported ONNX model and related files.
-
-This is a quick way to confirm that:
-
-- Olive can load the source model
-- the selected optimization recipe is valid for your setup
-- the conversion path completes before you run the full model
-
-If you omit the folder and just pass `--test`, `olive run` will save the reduced model under `<output_path>/test_model`.
-
-## Step 3: run the full conversion
-
-Once the smoke test succeeds, rerun the conversion on the full Qwen checkpoint by removing `--test`.
-
-```bash
-olive run \
-    --config out/qwen/config.json \
-    --output_path out/qwen-full
-```
+Because this example runs without `--dry_run`, it produces:
 
-At this point you know the Olive command and the conversion recipe already worked on the lightweight test model, so you can focus on the full-model run instead of debugging both at once.
+- `out/qwen/olive_config.json` — the Olive configuration used for the run (named `olive_config.json` so it is never confused with the model's own `config.json`).
+- `out/qwen/model/` — the optimized ONNX model.
+- `out/qwen/discrepancy_check_results.json` — the discrepancy report.
 
-## Why keep the test model folder?
+It also inserts an `OnnxDiscrepancyCheck` pass (if one is not already present) that will compare the generated ONNX model against the 2-layer reference model.
 
-The saved test model is useful beyond the first smoke test:
+Additional metrics can be requested via `--test_metrics` (space- or comma-separated):
 
-- you can rerun the reduced conversion quickly while iterating on options
-- you can reuse the same HF test model later when comparing the Hugging Face model against the exported ONNX model
-- you avoid recreating a new random test checkpoint every time
+- `speedup`: ONNX-vs-PyTorch inference latency
+- `first_token_20`: compares the first generated token (over a 20-token generation) between ONNX Runtime GenAI and transformers
+- `tft`: time to the first generated token (reported for both ONNX Runtime GenAI and transformers)
+- `tf5t`: time to the first 5 generated tokens (reported for both ONNX Runtime GenAI and transformers)
 
-## Related docs
+For example, `--test_metrics mae,speedup,first_token_20,tft,tf5t`. The generation metrics (`first_token_20`, `tft`, `tf5t`) use the optimized ONNX model directory as the ONNX Runtime GenAI model when it contains a `genai_config.json` (as produced by the model builder).
 
-- [How to use the `olive optimize` command to optimize a Pytorch model](cli-optimize)
-- [How to write a new workflow from scratch](../configure-workflows/build-workflow)
-- [CLI reference](../../reference/cli)
+> **Note:** `--test_metrics` is always respected even when the config was generated by `olive optimize --test`, because Olive updates the existing `OnnxDiscrepancyCheck` settings each time `olive run --test` is invoked.