fix(#543): coerce simple_bench CSV numeric columns; drop bad rows instead of crashing by FileSystemGuy · Pull Request #561 · mlcommons/storage

FileSystemGuy · 2026-06-26T23:41:39Z

Fixes #543.

Summary

vdbbench.simple_bench.calculate_statistics() was crashing after the timed VectorDB benchmark had already completed:

```
TypeError: unsupported operand type(s) for +: 'float' and 'str'
at (all_data["timestamp"] + all_data["batch_time_seconds"]).max()
```

The reporter diagnosed the root cause: a small number of rows in two per-process CSVs (milvus_benchmark_p2.csv, milvus_benchmark_p8.csv) carried a batch_time_seconds value of the string True instead of a float duration. Once any value in a numeric column is non-numeric, pandas demotes the whole column to object dtype, and the arithmetic on (timestamp + batch_time_seconds).max() fails. The 60s benchmark completes, ground-truth runs, all 16 worker processes write their CSVs — then the wrapper marks the whole run as failed during the post-run statistics computation.

This PR hardens the reader so that result-data corruption in a handful of rows out of millions no longer destroys the whole run. Writer-side root-causing (why a row occasionally has shifted/missing fields) is left as a follow-up — the reader-side defense is enough to recover the benchmark result so long as the malformed rows are a small fraction.

Fix

In vdb_benchmark/vdbbench/simple_bench.py::calculate_statistics:

Tag every loaded row with its source CSV name so the diagnostic identifies affected workers.
Coerce timestamp, batch_size, batch_time_seconds, avg_query_time_seconds with pd.to_numeric(..., errors="coerce") before any arithmetic.
Identify rows where any required numeric column became NaN; log a clear warning naming the dropped count and affected file(s); drop those rows.
Continue with the remaining good rows. If every row was bad, fall through to the existing "No valid data found" error rather than raising.

The diagnostic line, when triggered, looks like:

```
Warning: dropping 2 row(s) with non-numeric values in required columns.
Affected file(s): milvus_benchmark_p2.csv, milvus_benchmark_p8.csv.
See #543.
```

Tests

New file: vdb_benchmark/tests/tests/test_issue_543_csv_coercion.py — 8 tests:

Test	What it covers
`test_calculate_statistics_recovers_from_reporter_malformed_row`	Exact 16-worker scenario from the issue body — 14 clean CSVs, 2 with one malformed row each in the reporter's exact shape (`batch_time_seconds=True`, leading + trailing columns empty). Asserts no crash and stats returned.
`test_calculate_statistics_drops_rows_with_non_numeric_values[…]` × 5	Parametrized — any one non-numeric value in any of the four numeric columns (`timestamp`, `batch_time_seconds`, `batch_size`, `avg_query_time_seconds`) must not crash.
`test_calculate_statistics_returns_error_when_every_row_is_malformed`	Edge case: all rows malformed returns the existing error key, not a raise.
`test_calculate_statistics_unchanged_for_clean_csvs`	Sanity: clean CSVs still produce a complete stats dict (happy-path regression check).

Test plan

`pytest vdb_benchmark/tests/tests/test_issue_543_csv_coercion.py` — 8/8 pass
`pytest vdb_benchmark/tests/` — 144/144 pass (no regressions in the wider vdb suite)

…tead of crashing `vdbbench.simple_bench.calculate_statistics()` was crashing after the timed VectorDB benchmark had already completed: TypeError: unsupported operand type(s) for +: 'float' and 'str' at (all_data["timestamp"] + all_data["batch_time_seconds"]).max() The reporter (issue #543) diagnosed the root cause: a small number of rows in two per-process CSVs (`milvus_benchmark_p2.csv`, `milvus_benchmark_p8.csv`) carried a `batch_time_seconds` value of the string `True` instead of a float duration. Once any value in a numeric column is non-numeric, pandas demotes the whole column to object dtype, so the arithmetic on `(timestamp + batch_time_seconds).max()` fails with TypeError. The 60s benchmark completes, ground-truth runs, all 16 worker processes write their CSVs — then the wrapper marks the whole run as failed during the post-run statistics computation. This commit hardens the reader. Writer-side root-causing (why a row occasionally has shifted/missing fields) is left as follow-up; the reader-side defense is enough to recover the benchmark result so long as the malformed rows are a small fraction of the total. Fix: * Tag every loaded row with its source CSV name so the diagnostic can identify affected workers. * Coerce `timestamp`, `batch_size`, `batch_time_seconds`, `avg_query_time_seconds` with `pd.to_numeric(..., errors="coerce")` before any arithmetic. * Identify rows where any required numeric column became NaN, log a clear warning naming the dropped count and affected file(s), then drop those rows. * Continue with the remaining good rows. If every row was bad, fall through to the existing "No valid data found" error rather than raising. Tests (`vdb_benchmark/tests/tests/test_issue_543_csv_coercion.py`): * Reproduces the reporter's exact 16-worker scenario (14 clean CSVs, 2 with one malformed row each in the exact shape from the issue body) and asserts `calculate_statistics` returns valid stats. * Parametrized: any one non-numeric value in any of the four numeric columns must not crash. * Edge case: all rows malformed returns the existing error key, not a raise. * Sanity: clean CSVs still produce a complete stats dict (no regression of the happy path).

github-actions · 2026-06-26T23:41:47Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

FileSystemGuy requested a review from a team June 26, 2026 23:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(#543): coerce simple_bench CSV numeric columns; drop bad rows instead of crashing#561

fix(#543): coerce simple_bench CSV numeric columns; drop bad rows instead of crashing#561
FileSystemGuy wants to merge 1 commit into
mainfrom
fix/issue-543-simple-bench-csv-coercion

FileSystemGuy commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

FileSystemGuy commented Jun 26, 2026

Summary

Fix

Tests

Test plan

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant