Skip to content

fix(#543): coerce simple_bench CSV numeric columns; drop bad rows instead of crashing#561

Open
FileSystemGuy wants to merge 1 commit into
mainfrom
fix/issue-543-simple-bench-csv-coercion
Open

fix(#543): coerce simple_bench CSV numeric columns; drop bad rows instead of crashing#561
FileSystemGuy wants to merge 1 commit into
mainfrom
fix/issue-543-simple-bench-csv-coercion

Conversation

@FileSystemGuy

Copy link
Copy Markdown
Contributor

Fixes #543.

Summary

vdbbench.simple_bench.calculate_statistics() was crashing after the timed VectorDB benchmark had already completed:

```
TypeError: unsupported operand type(s) for +: 'float' and 'str'
at (all_data["timestamp"] + all_data["batch_time_seconds"]).max()
```

The reporter diagnosed the root cause: a small number of rows in two per-process CSVs (milvus_benchmark_p2.csv, milvus_benchmark_p8.csv) carried a batch_time_seconds value of the string True instead of a float duration. Once any value in a numeric column is non-numeric, pandas demotes the whole column to object dtype, and the arithmetic on (timestamp + batch_time_seconds).max() fails. The 60s benchmark completes, ground-truth runs, all 16 worker processes write their CSVs — then the wrapper marks the whole run as failed during the post-run statistics computation.

This PR hardens the reader so that result-data corruption in a handful of rows out of millions no longer destroys the whole run. Writer-side root-causing (why a row occasionally has shifted/missing fields) is left as a follow-up — the reader-side defense is enough to recover the benchmark result so long as the malformed rows are a small fraction.

Fix

In vdb_benchmark/vdbbench/simple_bench.py::calculate_statistics:

  1. Tag every loaded row with its source CSV name so the diagnostic identifies affected workers.
  2. Coerce timestamp, batch_size, batch_time_seconds, avg_query_time_seconds with pd.to_numeric(..., errors="coerce") before any arithmetic.
  3. Identify rows where any required numeric column became NaN; log a clear warning naming the dropped count and affected file(s); drop those rows.
  4. Continue with the remaining good rows. If every row was bad, fall through to the existing "No valid data found" error rather than raising.

The diagnostic line, when triggered, looks like:

```
Warning: dropping 2 row(s) with non-numeric values in required columns.
Affected file(s): milvus_benchmark_p2.csv, milvus_benchmark_p8.csv.
See #543.
```

Tests

New file: vdb_benchmark/tests/tests/test_issue_543_csv_coercion.py — 8 tests:

Test What it covers
test_calculate_statistics_recovers_from_reporter_malformed_row Exact 16-worker scenario from the issue body — 14 clean CSVs, 2 with one malformed row each in the reporter's exact shape (batch_time_seconds=True, leading + trailing columns empty). Asserts no crash and stats returned.
test_calculate_statistics_drops_rows_with_non_numeric_values[…] × 5 Parametrized — any one non-numeric value in any of the four numeric columns (timestamp, batch_time_seconds, batch_size, avg_query_time_seconds) must not crash.
test_calculate_statistics_returns_error_when_every_row_is_malformed Edge case: all rows malformed returns the existing error key, not a raise.
test_calculate_statistics_unchanged_for_clean_csvs Sanity: clean CSVs still produce a complete stats dict (happy-path regression check).

Test plan

  • `pytest vdb_benchmark/tests/tests/test_issue_543_csv_coercion.py` — 8/8 pass
  • `pytest vdb_benchmark/tests/` — 144/144 pass (no regressions in the wider vdb suite)

…tead of crashing

`vdbbench.simple_bench.calculate_statistics()` was crashing after the
timed VectorDB benchmark had already completed:

    TypeError: unsupported operand type(s) for +: 'float' and 'str'
        at (all_data["timestamp"] + all_data["batch_time_seconds"]).max()

The reporter (issue #543) diagnosed the root cause: a small number of
rows in two per-process CSVs (`milvus_benchmark_p2.csv`,
`milvus_benchmark_p8.csv`) carried a `batch_time_seconds` value of the
string `True` instead of a float duration. Once any value in a numeric
column is non-numeric, pandas demotes the whole column to object
dtype, so the arithmetic on `(timestamp + batch_time_seconds).max()`
fails with TypeError. The 60s benchmark completes, ground-truth runs,
all 16 worker processes write their CSVs — then the wrapper marks
the whole run as failed during the post-run statistics computation.

This commit hardens the reader. Writer-side root-causing (why a row
occasionally has shifted/missing fields) is left as follow-up; the
reader-side defense is enough to recover the benchmark result so long
as the malformed rows are a small fraction of the total.

Fix:

* Tag every loaded row with its source CSV name so the diagnostic can
  identify affected workers.
* Coerce `timestamp`, `batch_size`, `batch_time_seconds`,
  `avg_query_time_seconds` with `pd.to_numeric(..., errors="coerce")`
  before any arithmetic.
* Identify rows where any required numeric column became NaN, log a
  clear warning naming the dropped count and affected file(s), then
  drop those rows.
* Continue with the remaining good rows. If every row was bad, fall
  through to the existing "No valid data found" error rather than
  raising.

Tests (`vdb_benchmark/tests/tests/test_issue_543_csv_coercion.py`):

* Reproduces the reporter's exact 16-worker scenario (14 clean CSVs,
  2 with one malformed row each in the exact shape from the issue
  body) and asserts `calculate_statistics` returns valid stats.
* Parametrized: any one non-numeric value in any of the four numeric
  columns must not crash.
* Edge case: all rows malformed returns the existing error key, not a
  raise.
* Sanity: clean CSVs still produce a complete stats dict (no
  regression of the happy path).
@FileSystemGuy FileSystemGuy requested a review from a team June 26, 2026 23:41
@github-actions

Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant