Hnimrama/inferencemax uplift restore by hnimra-amd · Pull Request #229 · ROCm/cvs

hnimra-amd · 2026-06-18T05:01:03Z

Summary

Restores the InferenceMax / DTNI uplift from Hnimrama/inferencemax uplift #225 after it was reverted by Revert "Hnimrama/inferencemax uplift" #228.
Adds a follow-up fix so the vLLM benchmark client runs with the same Python that imports vllm (ROCm images where python3 is not the vLLM interpreter), and hardens docker exec / bash -c quoting.

Test plan

cvs run inferencemax_single (or your usual InferenceMax path) on MI300x + current vLLM container.

…inery Rename cvs/lib/dtni to cvs/lib/utils ("utils" says what it is: pure functions any lib can call; "dtni" was a leftover project codename). verdict.py moves unchanged. config_loader.py is trimmed to the framework-agnostic half: the paths/model/image/container schema, the 3-pass placeholder substitution, the enforce_thresholds gate on a new BaseVariantConfig, and a substitute_config() helper (file read + substitution + sibling-threshold discovery). The inference-only schema moves to a sibling module in a later commit.

The client.* metric parser (to_client_metrics + CLIENT_METRICS surface) is inference-specific and should not sit in the shared utils dir. Move it under a new cvs/lib/inference/utils package. Content unchanged.

The inference half of the old config_loader: GoodputSlo, SeqCombo, Sweep, Params, Roles, and VariantConfig(BaseVariantConfig) with cell_key and the threshold-coverage check. load_variant() delegates the file read and placeholder substitution to the generic substitute_config(). Replaces the sequence_combinations x concurrency_levels cartesian with a named-combo + explicit runs[] selector: each run is a {combo, concurrency} pair, so the config enumerates exactly the cells to run (no NxM explosion). A model_validator rejects duplicate combo names and runs referencing an unknown combo at load time.

…rsing build_server_cmd/start_server assemble a `vllm serve` arg list in Python (mirroring run_client) instead of cloning and running an external .sh, so a run needs no hand-staged script. --max-model-len is derived per cell from isl/osl/random_range_ratio so any sweep change stays self-consistent. parse_results reads the stock extensionless `results` JSON artifact that `vllm bench serve` writes to --result-dir and delegates namespacing + derived-metric math to vllm_parsing.to_client_metrics, replacing the brittle console-log regex table. Missing/empty/unparseable artifacts hard-fail the cell rather than recording a silently-green empty row. Adds the optional --goodput SLO gate (per-cell, omitted when no SLO) and threads goodput_slo through run().

pytest_generate_tests now drives parametrization from the named-combo + runs[] selector instead of the cartesian. test_vllm_inference only runs the benchmark and stashes results; the verdict moves into a new test_metric (one pytest test = one HTML row per metric per cell), with inline Value/Unit columns added via pytest_html hooks in conftest. test_setup_sshd gates its 2224 probe on len(orch.hosts) > 1, mirroring the single-node orchestrator guard: single-node runs skip the in-container sshd (it exists only for inter-node MPI) and must not probe for it. Import paths follow the dtni -> utils / inference.utils moves.

…luster file Rename the config/threshold pair to {model}_{precision}_{config|threshold} .json and convert the sweep to the named-combo + runs[] selector. Pin the image to rocm/vllm-dev:nightly (the previously pinned :nightly-sshd tag does not exist on Docker Hub, and single-node runs skip in-container sshd). Delete cvs/input/cluster_file/mi300x_vllm_single.json: a cluster file only needs node IP + user/key/orchestrator; the variant config supplies the container block, so the bespoke per-suite cluster file is redundant.

…uards Cover to_client_metrics purity + derived metrics, the named-combo/runs sweep selector (expansion, unknown-combo and duplicate-name rejection), the run_client goodput/metric-percentiles flags, table-cell rendering, and the verdict None-guards. Adds JSON fixtures for the stock results artifact.

Add a human reference guide (plans/building-a-cvs-test-suite.md) that walks the six-layer suite architecture using vllm_single as the worked example: the generic <-> framework config seam, the named-combo + runs[] sweep selector, the self-contained Python-built server cmd, lifecycle-as-tests, and a checklist for authoring a new inference or training suite. Add per-package AGENTS.md docs naming the public entry points, the seam, and the non-obvious gotchas: - cvs/lib/utils: substitute_config / BaseVariantConfig / evaluate_all, the 3-pass placeholder order, sibling-glob threshold discovery, parent-first validator ordering. - cvs/lib/inference/utils: load_variant / to_client_metrics / CLIENT_METRICS, the cell_key single-source-of-truth, the coverage check that prevents a silent green, and the validators mirrored in pytest_generate_tests.

… rules Document the dtni -> utils rename and the shared (cvs/lib/utils) vs domain-specific (cvs/lib/inference/utils) split: what lives where, the rule for placing a new helper, and the directory map. Note training is not yet ported and this guide is the blueprint for that port. Remove the manual --- horizontal rules between sections: heading levels already render their own bottom border, so the extra rules produced a double-underline. Minor prose/format cleanups.

The threshold file carried placeholder CONC=64/128/256 entries while the sweep's only run is concurrency 16, so cell_key() matched no threshold. The mismatch was masked by enforce_thresholds=false (warned, not raised) and would have failed load the instant enforcement was flipped on. Re-key to the single CONC=16 cell the runs selector actually enumerates.

MODEL/ISL/OSL/MAX_MODEL_LEN/RANDOM_RANGE_RATIO/TP/CONC/PORT were exported into /tmp/server_env_script.sh but read by nothing after the .sh->Python server command refactor -- both _server_argv and run_client pass these as explicit flags. Keep only the env the vllm process actually consumes (HF token, HF cache pin, AITER flags). Also drops the second _derive_max_model_len call that fed the dead MAX_MODEL_LEN export.

_server_argv hard-coded --kv-cache-dtype fp8, baking a per-model property into the shared orchestrator -- a non-fp8-KV model dropped into the suite would be served wrong with no config recourse (extra_serve_args can only add, so an override would pass the flag twice). Declare it in the W1 config's roles.server.extra_serve_args instead; the driver stays model-agnostic and 'new model = new config' holds.

…nd collection pytest_generate_tests hand-reimplemented the duplicate-name and unknown-run.combo checks that Sweep._check_runs_reference_known_combos already enforces, with divergent semantics (first-failure raise vs all-at-once). Extract validate_sweep_selector() as the single home and call it from both the typed validator (load time) and the collection-time raw-JSON path so the rule can't drift.

out_dir was fixed per job, so a multi-cell sweep would overwrite each cell's `results` and client.log, and parse_results could cat a prior cell's stale artifact if the current cell's client failed to write one. Key it by cell. Latent today (the shipped sweep has one cell).

run_client accepted goodput_slo as either a dict (.get) or an object (getattr) via a per-key hasattr branch, but the only production caller passes a raw dict -- the object path existed solely for a unit test, and the dual path meant the typed GoodputSlo's validation never reached the command builder. Consume the dict only and drop the object-form test.

goodput_slo_unused was never read (goodput is threaded through _make_job).

The committed vllm_single example config carried a personal hf-token path and a concrete image tag (duplicated in image.tag and container.image). Replace all three with <changeme> so the file is a template a new user must fill in -- it still loads (collection works) and only fails at run time when an unedited value is read, which is the intended signal.

The per-model server knobs were a flat [flag, value, flag, value] list (roles.server.extra_serve_args), which reads poorly. Replace with a roles.server.serve_args {flag: value} map (flag without the leading --): a scalar renders --flag value, True a bare --flag, a list the flag once per element -- so it stays readable while still covering vllm bare/repeatable flags. _server_argv flattens the map via a new _flatten_serve_args helper; the derived flags (tp/max-model-len/port) stay code-built. Also repoints a stale unit test that asserted MAX_MODEL_LEN in the env script (it moved to the --max-model-len flag in an earlier commit) to assert against the server argv instead.

The only ceiling kind was max_ms, whose message hard-codes ms. A count metric like client.failed needs an upper bound without the unit lie; add a plain max with the same comparison and an honest message.

Previously only cell-presence was validated: a cell could exist while a given metric had no spec, and test_metric (spec is None -> return) would silently report a green record-only row even under enforce_thresholds=true. A new perf metric was thus unvalidated by default. Declare GATED_METRICS beside CLIENT_METRICS -- the perf+health subset that must assert (throughput, mean+p99 latency, success_rate/failed) -- and extend _check_thresholds_cover_sweep to require a spec for every gated metric in every present cell, reusing the same enforce-vs-warn path. A new metric is record-only until added to the set; once gated, the loader forces a spec in every cell before the suite can run green. Inputs, totals, and derived diagnostics stay record-only by design.

The image was declared twice: top-level image.tag (live -- conftest copied it onto the container block) and container.image (dead -- overwritten by that copy). The duplicate forced a top-level image block whose remote field was unused and whose tag silently shadowed container.image. Make container.image the single source: drop the top-level ImageSpec block from the generic BaseVariantConfig, drop the conftest overwrite so the merged container.image is used as-is, and remove the now-schemaless image block from the example config.

Expand GATED_METRICS from the mean+p99 subset to every emitted latency quantile (mean/median/p90/p95/p99) for ttft, tpot, itl, and e2el -- itl omits p90 as CLIENT_METRICS has no producer for it. Throughput and success_rate/failed health are unchanged. Inputs, totals, secondary throughputs, and derived diagnostics stay record-only. The example threshold file gains a placeholder spec for each newly gated metric (23 total) so the loader gated-coverage check passes.

- utils/AGENTS.md: drop top-level image (now container.image); add max verdict kind - inference/utils/AGENTS.md: document GATED_METRICS contract + dual-axis coverage check - building-a-cvs-test-suite.md: container.image, serve_args map, max kind, GATED_METRICS - dtni-dev-guide.md: SUPERSEDED banner pointing to building guide + AGENTS.md

Reverts commit 4a8425f, restoring the changes from PR #225 on dev/dtni.

Probe python3.13..python3 for import vllm; export BENCH_PY and BENCH_SCRIPT. Use shlex.quote for docker exec bash -c. Align InferenceMax client completion with Serving Benchmark Result or End-to-end Latency.

Search site-packages and ancestor paths, verify the file is readable, and document vllm[bench] when wheels omit benchmarks/.

Use CVS_GPU_MEMORY_UTIL in sample config and serve script to avoid vLLM unknown-env warnings. Extend default readiness poll budget to 60 and grep full server logs so Uvicorn ready is not missed after long model loads.

Wheels often omit vllm/benchmarks; resolve the driver via eval exports, run python -m vllm.entrypoints.cli.main bench serve when needed, and fail fast on missing-script log patterns in InferenceMax and base polling.

vLLM random workloads scale (ISL+OSL)*(1+r); clamp ratio when it would exceed MML, pass --temperature 0 for greedy parity, and forward --metric-percentiles in InferenceMax and vllm_single clients.

Read client_poll_count and client_poll_wait_time from benchmark_params (defaults 50/60), document them and fix the inferencemax.rst table, and surface the keys in sample MI300X/MI355X configs.

…st polling Gate benchmark success on Failed requests only after the summary is present; tail more client log lines for InferenceMax. Variant and benchmark_params accept bench_max_failed_requests (default 0 remains strict for CI).

Move InferenceMax loading onto substitute_config and a typed InferenceMaxVariantConfig with legacy adapters for InferenceMaxJob until the driver is ported.

…ase 2) Flatten MI300X and MI355X variant configs to paths/model/container/roles/params/sweep and client.* threshold specs with enforce_thresholds false until recalibrated.

…Phase 2) Use variant_config and legacy adapter fixtures, parametrization from sweep.runs, and unit tests for load_variant and threshold adapters.

…ion 1 (Phase 2) Point loader and threshold docs at inferencemax_config_loader.load_variant and the client.* sweep cell format.

Point run-cvs-tests and dtni-dev-guide at cvs.lib.utils and inference/utils loaders.

Standalone driver uses Python-built vllm serve, vllm bench serve, and artifact parsing. Drop legacy InferenceBaseJob path and factory construction.

…_args (Phase 3) MI300X and MI355X variants drop host-script and bench_serving params in favor of Python serve args.

… (Phase 3) Add model_fetch, test_metric, and new InferenceMaxJob lifecycle. Update conftest and unit tests for typed config.

…ase 3) Document Python serve, client.* metrics, and expanded lifecycle test stages.

Host script staging was dropped when InferenceMaxJob moved to Python-built vllm serve.

InferenceMax and vllm_single build vllm serve in Python; this package remains for InferenceBaseJob paths.

…(Phase 5) Replace legacy config/benchmark_params table with typed blocks and client.* thresholds. Document inferencemax_config_loader in AGENTS.md.

Verify stock results artifact maps to client.* metrics via FakeOrch.

amd-droy · 2026-06-23T19:51:42Z

+    if not orch.verify_containers_running(name):
+        lifecycle.failed = True
+        pytest.fail(f"container {name} not running after setup_containers()")
+    time.sleep(30)


will you need this gap of 30 secs after the testcase?

amd-droy · 2026-06-23T19:54:11Z

+
+
+def _du_bytes(orch, path):
+    out = orch.exec(f"bash -c {shlex.quote(f'du -sb {shlex.quote(path)} 2>/dev/null | cut -f1')}")


The 2>/dev/null silently swallows errors, and the function returns 0 on any failure. Since test_model_fetch uses the return value of 0 to detect "no model present," a network failure, permission error, or du not being available would be indistinguishable from a genuinely missing model

amd-droy · 2026-06-23T19:56:57Z

+        lifecycle.record(request.node.nodeid, "server_ready", time.monotonic() - t)
+        t_client = time.monotonic()
+        job.run_client()
+        job.wait_client_complete()


better to have a timeout here. If the inference client hangs, this will block indefinitely with no recovery path.

amd-droy · 2026-06-23T19:59:49Z

+        return fp.read().strip()
+
+
+@pytest.fixture(scope="session")


will the scope be session or module?

anujmittal-amd

Can we add your changed under new folder structure? Inferencemax has been depreciated, you should create folder under inference as atom or reuse the vllm/sglang structure and add atom in that location. Lets discuss this.
Same comment for single/distributed; do we need a separate folder for single/multi?
dtni folder should be renamed based on vLLM structure.

Once you are ready for final merge, please add your end to end test results to the PR along with original inferenceMax test run to identify there are no regressions.

Thanks

hnimra-amd · 2026-06-24T00:57:28Z

Can we add your changed under new folder structure? Inferencemax has been depreciated, you should create folder under inference as atom or reuse the vllm/sglang structure and add atom in that location. Lets discuss this. Same comment for single/distributed; do we need a separate folder for single/multi? dtni folder should be renamed based on vLLM structure.

Once you are ready for final merge, please add your end to end test results to the PR along with original inferenceMax test run to identify there are no regressions.

Thanks

@anujmittal-amd I addressed the structural feedback on the IX-atom branch, this is the PR for it: #238
and it builds on the current PR

Drop the post-launch sleep, fail model fetch on du errors instead of treating them as an empty cache, scope inf_res_dict to module like vllm_single, and document the bounded client poll timeout.

…epseek r1 model * Restore InferenceMax uplift reverted by #228 Reverts commit 4a8425f, restoring the changes from PR #225 on dev/dtni. * fix(inference): run vLLM bench client with vLLM interpreter Probe python3.13..python3 for import vllm; export BENCH_PY and BENCH_SCRIPT. Use shlex.quote for docker exec bash -c. Align InferenceMax client completion with Serving Benchmark Result or End-to-end Latency. * fix(dtni): broaden vLLM benchmark script discovery Search site-packages and ancestor paths, verify the file is readable, and document vllm[bench] when wheels omit benchmarks/. * fix(inference): harden InferenceMax server startup and GPU mem env Use CVS_GPU_MEMORY_UTIL in sample config and serve script to avoid vLLM unknown-env warnings. Extend default readiness poll budget to 60 and grep full server logs so Uvicorn ready is not missed after long model loads. * fix(dtni): fall back to vllm bench serve when benchmark script is absent Wheels often omit vllm/benchmarks; resolve the driver via eval exports, run python -m vllm.entrypoints.cli.main bench serve when needed, and fail fast on missing-script log patterns in InferenceMax and base polling. * fix(dtni): clamp bench random-range to max_model_length vLLM random workloads scale (ISL+OSL)*(1+r); clamp ratio when it would exceed MML, pass --temperature 0 for greedy parity, and forward --metric-percentiles in InferenceMax and vllm_single clients. * fix(inference): extend InferenceMax bench client poll budget Read client_poll_count and client_poll_wait_time from benchmark_params (defaults 50/60), document them and fix the inferencemax.rst table, and surface the keys in sample MI300X/MI355X configs. * feat(inference): add bench_max_failed_requests cap and completion-first polling Gate benchmark success on Failed requests only after the summary is present; tail more client log lines for InferenceMax. Variant and benchmark_params accept bench_max_failed_requests (default 0 remains strict for CI). * feat(inference): add typed InferenceMax config loader (Phase 1) Move InferenceMax loading onto substitute_config and a typed InferenceMaxVariantConfig with legacy adapters for InferenceMaxJob until the driver is ported. * feat(inference): migrate InferenceMax configs to schema_version 1 (Phase 2) Flatten MI300X and MI355X variant configs to paths/model/container/roles/params/sweep and client.* threshold specs with enforce_thresholds false until recalibrated. * test(inference): wire inferencemax_single to typed config and sweep (Phase 2) Use variant_config and legacy adapter fixtures, parametrization from sweep.runs, and unit tests for load_variant and threshold adapters. * docs(inference): update InferenceMax config reference for schema_version 1 (Phase 2) Point loader and threshold docs at inferencemax_config_loader.load_variant and the client.* sweep cell format. * docs: fix stale dtni.config_loader references (Phase 1 tail) Point run-cvs-tests and dtni-dev-guide at cvs.lib.utils and inference/utils loaders. * feat(inference): rewrite InferenceMaxJob like VllmJob (Phase 3) Standalone driver uses Python-built vllm serve, vllm bench serve, and artifact parsing. Drop legacy InferenceBaseJob path and factory construction. * feat(inference): move InferenceMax server flags to roles.server.serve_args (Phase 3) MI300X and MI355X variants drop host-script and bench_serving params in favor of Python serve args. * test(inference): align inferencemax_single suite with VllmJob pattern (Phase 3) Add model_fetch, test_metric, and new InferenceMaxJob lifecycle. Update conftest and unit tests for typed config. * docs(inference): update InferenceMax reference for Phase 3 driver (Phase 3) Document Python serve, client.* metrics, and expanded lifecycle test stages. * chore(inference): remove unused inferencemax_host_scripts (Phase 5) Host script staging was dropped when InferenceMaxJob moved to Python-built vllm serve. * docs: clarify vllm_benchmark_scripts are legacy-only (Phase 5) InferenceMax and vllm_single build vllm serve in Python; this package remains for InferenceBaseJob paths. * docs(inference): rewrite InferenceMax reference for schema_version 1 (Phase 5) Replace legacy config/benchmark_params table with typed blocks and client.* thresholds. Document inferencemax_config_loader in AGENTS.md. * test(inference): add InferenceMaxJob parse_results unit test (Phase 5) Verify stock results artifact maps to client.* metrics via FakeOrch. * refactor(inference): rename inferencemax_single to inferencex_atom_single Adopt InferenceX ATOM as the framework identity while the suite is still internal. Renames the driver, config loader, pytest suite, variant configs, and documentation to inferencex_atom_single. * docs(plan): add InferenceX ATOM automation plan (MI300X + MI355X) Align the implementation plan with DTNI Validation Tracker workloads W1-W18, MI300X calibration seeds, and gsm8k accuracy gates. MI300X and MI355X variant dirs ship in parallel; MI355X thresholds stay record-only until lab calibration. Milestone 1 targets ATOM backend plus W1 on both arches. * docs(plan): add MI355X W1 calibration seeds from ATOM CI Document section 4.3 thresholds from ROCm/ATOM run 27912164002 and align M1 scope for dual-arch W1 calibration. * feat(inference): Phase 0 ATOM driver and W1 DeepSeek R1 variants Swap InferenceXAtomJob to atom.entrypoints.openai_server and atom.benchmarks.benchmark_serving when params.driver=atom. Add MI300X/MI355X W1 config+threshold dirs and cluster examples for deepseek-ai/DeepSeek-R1-0528. * feat(inference): complete Phase 0 W1 DeepSeek R1 recipe pins and MTP3 variants Add ix_recipes.json registry, ix_recipe_id/run_card in config loader, MTP3 variant dirs for MI300X/MI355X, copy-config dtni root, and run-card logging in tests. * feat(inference): add MI300X W1 DeepSeek R1 smoke variant Single-cell smoke config (C=128, 128 prompts) for shorter first lab validation before full atom_perf calibration. * FIX: underscore-prefixed keys (including _comment) are now stripped from the config dict before Pydantic validation, same as for thresholds. * Move InferenceX ATOM W1 configs to config_file layout and calibrate MI300X perf gates. Relocate DeepSeek R1 variants from input/dtni to the standard inference config tree, enable enforce_thresholds with lab-calibrated thresholds, and clear stale results.json before each benchmark run. * docs: replace section symbol with plain Section references. Use readable Section/Sections wording in plans and InferenceX ATOM variant config comments instead of the section sign character. * feat(inference): calibrate MI355X W1 thresholds from ATOM CI seeds. Apply the same 10% margin as MI300X lab gates to perf and MTP3 variant thresholds from ROCm/ATOM run 27912164002, document copy-config flow in README, and add config-loader unit tests. * fix(inference): address PR #229 review on inferencex_atom_single suite. Drop the post-launch sleep, fail model fetch on du errors instead of treating them as an empty cache, scope inf_res_dict to module like vllm_single, and document the bounded client poll timeout. * fix(inference): unbreak W1 Phase A threshold checks for ATOM artifacts. Pin DeepSeek W1 container names and derive failed/success_rate when ATOM omits failed; skip threshold enforcement for metrics the benchmark did not emit. * docs(plan): MI355X lab pending without blocking MI300X spine. Add Section 1.2 hardware policy, revise M1/Phase A exit criteria, and update milestone diagrams so MI355X confirmation is optional until nodes are available. * docs(plan): refresh IX-atom plan with accuracy, metrics, and CVS backlog. Align branch state and phases with ATOM driver reality, add accuracy test catalog, metric tiers, and platform enhancements; update W1 README lab notes. * docs(plan): add Section 12 coverage for variants, parity frameworks, and metrics. Document perf variant modes, workload-specific accuracy tests, inferencex_atom_vllm/sglang parity suites, supplemental and MTP metrics, and CI compare keys. * align: restore config_loader threshold discovery after dev/dtni rebase Re-apply merge resolutions from PR #233 alignment (dual threshold layout, vllm_single imports) that were lost when replaying commits onto e47df5a. * refactor(inference): deprecate InferenceMax legacy factory paths Route inferencemax_repo and framework=inferencemax to inferencex_atom with a warning. InferenceMax host jobs remain a placeholder that points callers at inferencex_atom_single. * refactor(inference): extract shared threshold sweep validation Add validate_thresholds_cover_sweep() for reuse by vllm_single and inferencex_atom loaders. Optional gated_metrics parameter allows framework-specific SLO sets. * refactor(inferencex): flatten configs and adopt vllm_single naming Rename W1 and GPT-OSS variant stems to {gpu}_inferencex-atom-single_{model}_{precision}[_{mode}] and remove per-variant subfolders. Update README and docs with new copy-config paths and W1 threshold gates for per-GPU throughput and tail latencies. * feat(inferencex): gate W1 per-GPU and tail latency metrics on ATOM path Add inferencex_atom_parsing with IX-specific GATED_METRICS (per_gpu_throughput, output_tput_per_gpu) without changing vllm_single. Wire InferenceXAtomJob and the suite to atom parsing; default metric_percentiles to 95,99 for p95 TPOT and p99 TTFT. * docs(inferencex): document ATOM-specific parsing vs vllm_single Clarify that W1 GATED_METRICS and output_tput_per_gpu live in inferencex_atom_parsing so vllm_single stays untouched until parity. * docs(inferencex): simplify variant README for lab users Drop vllm_single comparisons and internal threshold tables; keep naming pattern, variant list, and copy/run commands. * chore(utils): silence cluster placeholder resolution logs * test(config): align config_loader tests with sibling threshold discovery * feat(inferencex): add W1 metric tiers for tiered threshold gates * feat(inferencex): reuse server across sweep cells and config-driven waits * feat(inferencex): replace per-metric tests with tiered test_cell_metrics * chore(inferencex): enable server reuse on perf configs and shorter smoke waits * fix(inferencex): align cluster container names with variant configs * fix(inferencex): tighten W1 perf health gates when enforcing thresholds * test(inferencex): add GATED_METRICS parity and health gate coverage tests * test(inferencex): add GATED_METRICS parity and health gate coverage tests * docs(inferencex): update plan and README for flat layout and tiered gates * docs(inferencex): update plan and README for flat layout and tiered gates * chore(inferencex): trim expand_sweep docstring * docs(inferencex): clarify lab layout, launcher host, and results paths Document per-variant ~/input subdirs to avoid ambiguous threshold discovery, remote launcher vs GPU node prerequisites, and ~/cvs_results output paths. * docs(plan): prioritize multi-node as M5 after framework parity Elevate scaling to P1 milestone M5 immediately after M4 parity when hardware and suite recipes support nnodes>1; defer MTP+P2 widen to M6. * fix(inferencex): align W1 tpot tier with ATOM bench output Gate p99_tpot_ms instead of absent p95_tpot_ms, skip missing tier metrics in actuals, and recalibrate MI300X perf thresholds from the 2026-06-25 lab run. * chore(inferencex): use portable W1 perf thresholds on MI300X Replace per-node calibrated gates with conservative throughput floors and loose latency caps so healthy runs pass across lab nodes without recalibration. * test(inferencex): cover parse_results errors and client log failure paths * fix(inferencex): detect ATOM server early failures during wait_ready * refactor(inferencex): extract sweep reuse helpers and safer collection defaults * feat(inferencex): add explicit threshold_json paths to variant configs * chore(inferencex): polish conftest docs and simplify CLIENT_METRICS build * refactor(inferencex): inline atom_args and remove ix_recipe indirection * docs(plan): sync IX atom plan with inline atom_args config layout

hnimra-amd requested review from amd-droy, anujmittal-amd and atnair-amd June 18, 2026 06:55

atnair-amd and others added 27 commits June 18, 2026 21:07

refactor(inference): move vllm_parsing into cvs/lib/inference/utils

c0e7413

The client.* metric parser (to_client_metrics + CLIENT_METRICS surface) is inference-specific and should not sit in the shared utils dir. Move it under a new cvs/lib/inference/utils package. Content unchanged.

docs: demote headings so GitHub stops underlining sections

00c9cdc

removing old plan

94d2081

test(inference): drop unused _fake_variant parameter

6433f86

goodput_slo_unused was never read (goodput is threaded through _make_job).

feat(verdict): add unit-agnostic max threshold kind

f217c9d

The only ceiling kind was max_ms, whose message hard-codes ms. A count metric like client.failed needs an upper bound without the unit lie; add a plain max with the same comparison and an honest message.

Restore InferenceMax uplift reverted by #228

b553fcc

Reverts commit 4a8425f, restoring the changes from PR #225 on dev/dtni.

fix(inference): run vLLM bench client with vLLM interpreter

1f1ffdc

Probe python3.13..python3 for import vllm; export BENCH_PY and BENCH_SCRIPT. Use shlex.quote for docker exec bash -c. Align InferenceMax client completion with Serving Benchmark Result or End-to-end Latency.

hnimra-amd added 6 commits June 23, 2026 10:55

fix(dtni): broaden vLLM benchmark script discovery

81cadd3

Search site-packages and ancestor paths, verify the file is readable, and document vllm[bench] when wheels omit benchmarks/.

fix(inference): harden InferenceMax server startup and GPU mem env

abe838c

Use CVS_GPU_MEMORY_UTIL in sample config and serve script to avoid vLLM unknown-env warnings. Extend default readiness poll budget to 60 and grep full server logs so Uvicorn ready is not missed after long model loads.

fix(dtni): clamp bench random-range to max_model_length

2ac3957

vLLM random workloads scale (ISL+OSL)*(1+r); clamp ratio when it would exceed MML, pass --temperature 0 for greedy parity, and forward --metric-percentiles in InferenceMax and vllm_single clients.

fix(inference): extend InferenceMax bench client poll budget

7203562

Read client_poll_count and client_poll_wait_time from benchmark_params (defaults 50/60), document them and fix the inferencemax.rst table, and surface the keys in sample MI300X/MI355X configs.

hnimra-amd force-pushed the hnimrama/inferencemax-uplift-restore branch from 7e9bc30 to 730b408 Compare June 23, 2026 18:36

hnimra-amd added 13 commits June 23, 2026 11:45

feat(inference): add typed InferenceMax config loader (Phase 1)

65b5ec4

Move InferenceMax loading onto substitute_config and a typed InferenceMaxVariantConfig with legacy adapters for InferenceMaxJob until the driver is ported.

feat(inference): migrate InferenceMax configs to schema_version 1 (Ph…

718e066

…ase 2) Flatten MI300X and MI355X variant configs to paths/model/container/roles/params/sweep and client.* threshold specs with enforce_thresholds false until recalibrated.

test(inference): wire inferencemax_single to typed config and sweep (…

6062312

…Phase 2) Use variant_config and legacy adapter fixtures, parametrization from sweep.runs, and unit tests for load_variant and threshold adapters.

docs(inference): update InferenceMax config reference for schema_vers…

26aba03

…ion 1 (Phase 2) Point loader and threshold docs at inferencemax_config_loader.load_variant and the client.* sweep cell format.

docs: fix stale dtni.config_loader references (Phase 1 tail)

44bb053

Point run-cvs-tests and dtni-dev-guide at cvs.lib.utils and inference/utils loaders.

feat(inference): rewrite InferenceMaxJob like VllmJob (Phase 3)

246250b

Standalone driver uses Python-built vllm serve, vllm bench serve, and artifact parsing. Drop legacy InferenceBaseJob path and factory construction.

feat(inference): move InferenceMax server flags to roles.server.serve…

79969ba

…_args (Phase 3) MI300X and MI355X variants drop host-script and bench_serving params in favor of Python serve args.

test(inference): align inferencemax_single suite with VllmJob pattern…

8bf942d

… (Phase 3) Add model_fetch, test_metric, and new InferenceMaxJob lifecycle. Update conftest and unit tests for typed config.

docs(inference): update InferenceMax reference for Phase 3 driver (Ph…

17b902f

…ase 3) Document Python serve, client.* metrics, and expanded lifecycle test stages.

chore(inference): remove unused inferencemax_host_scripts (Phase 5)

3838d98

Host script staging was dropped when InferenceMaxJob moved to Python-built vllm serve.

docs: clarify vllm_benchmark_scripts are legacy-only (Phase 5)

cdeef6a

InferenceMax and vllm_single build vllm serve in Python; this package remains for InferenceBaseJob paths.

docs(inference): rewrite InferenceMax reference for schema_version 1 …

ba93dfb

…(Phase 5) Replace legacy config/benchmark_params table with typed blocks and client.* thresholds. Document inferencemax_config_loader in AGENTS.md.

test(inference): add InferenceMaxJob parse_results unit test (Phase 5)

0e39316

Verify stock results artifact maps to client.* metrics via FakeOrch.

amd-droy reviewed Jun 23, 2026

View reviewed changes

anujmittal-amd requested changes Jun 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hnimrama/inferencemax uplift restore#229

Hnimrama/inferencemax uplift restore#229
hnimra-amd wants to merge 46 commits into
dev/dtnifrom
hnimrama/inferencemax-uplift-restore

hnimra-amd commented Jun 18, 2026

Uh oh!

amd-droy Jun 23, 2026

Uh oh!

amd-droy Jun 23, 2026

Uh oh!

amd-droy Jun 23, 2026

Uh oh!

amd-droy Jun 23, 2026

Uh oh!

anujmittal-amd left a comment

Uh oh!

hnimra-amd commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants



		def _du_bytes(orch, path):
		out = orch.exec(f"bash -c {shlex.quote(f'du -sb {shlex.quote(path)} 2>/dev/null \| cut -f1')}")

Uh oh!

Conversation

hnimra-amd commented Jun 18, 2026

Summary

Test plan

Uh oh!

amd-droy Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

amd-droy Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

amd-droy Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

amd-droy Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

anujmittal-amd left a comment

Choose a reason for hiding this comment

Uh oh!

hnimra-amd commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants