fix(#543): coerce simple_bench CSV numeric columns; drop bad rows instead of crashing#561
Open
FileSystemGuy wants to merge 1 commit into
Open
fix(#543): coerce simple_bench CSV numeric columns; drop bad rows instead of crashing#561FileSystemGuy wants to merge 1 commit into
FileSystemGuy wants to merge 1 commit into
Conversation
…tead of crashing
`vdbbench.simple_bench.calculate_statistics()` was crashing after the
timed VectorDB benchmark had already completed:
TypeError: unsupported operand type(s) for +: 'float' and 'str'
at (all_data["timestamp"] + all_data["batch_time_seconds"]).max()
The reporter (issue #543) diagnosed the root cause: a small number of
rows in two per-process CSVs (`milvus_benchmark_p2.csv`,
`milvus_benchmark_p8.csv`) carried a `batch_time_seconds` value of the
string `True` instead of a float duration. Once any value in a numeric
column is non-numeric, pandas demotes the whole column to object
dtype, so the arithmetic on `(timestamp + batch_time_seconds).max()`
fails with TypeError. The 60s benchmark completes, ground-truth runs,
all 16 worker processes write their CSVs — then the wrapper marks
the whole run as failed during the post-run statistics computation.
This commit hardens the reader. Writer-side root-causing (why a row
occasionally has shifted/missing fields) is left as follow-up; the
reader-side defense is enough to recover the benchmark result so long
as the malformed rows are a small fraction of the total.
Fix:
* Tag every loaded row with its source CSV name so the diagnostic can
identify affected workers.
* Coerce `timestamp`, `batch_size`, `batch_time_seconds`,
`avg_query_time_seconds` with `pd.to_numeric(..., errors="coerce")`
before any arithmetic.
* Identify rows where any required numeric column became NaN, log a
clear warning naming the dropped count and affected file(s), then
drop those rows.
* Continue with the remaining good rows. If every row was bad, fall
through to the existing "No valid data found" error rather than
raising.
Tests (`vdb_benchmark/tests/tests/test_issue_543_csv_coercion.py`):
* Reproduces the reporter's exact 16-worker scenario (14 clean CSVs,
2 with one malformed row each in the exact shape from the issue
body) and asserts `calculate_statistics` returns valid stats.
* Parametrized: any one non-numeric value in any of the four numeric
columns must not crash.
* Edge case: all rows malformed returns the existing error key, not a
raise.
* Sanity: clean CSVs still produce a complete stats dict (no
regression of the happy path).
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #543.
Summary
vdbbench.simple_bench.calculate_statistics()was crashing after the timed VectorDB benchmark had already completed:```
TypeError: unsupported operand type(s) for +: 'float' and 'str'
at (all_data["timestamp"] + all_data["batch_time_seconds"]).max()
```
The reporter diagnosed the root cause: a small number of rows in two per-process CSVs (
milvus_benchmark_p2.csv,milvus_benchmark_p8.csv) carried abatch_time_secondsvalue of the stringTrueinstead of a float duration. Once any value in a numeric column is non-numeric, pandas demotes the whole column toobjectdtype, and the arithmetic on(timestamp + batch_time_seconds).max()fails. The 60s benchmark completes, ground-truth runs, all 16 worker processes write their CSVs — then the wrapper marks the whole run as failed during the post-run statistics computation.This PR hardens the reader so that result-data corruption in a handful of rows out of millions no longer destroys the whole run. Writer-side root-causing (why a row occasionally has shifted/missing fields) is left as a follow-up — the reader-side defense is enough to recover the benchmark result so long as the malformed rows are a small fraction.
Fix
In
vdb_benchmark/vdbbench/simple_bench.py::calculate_statistics:timestamp,batch_size,batch_time_seconds,avg_query_time_secondswithpd.to_numeric(..., errors="coerce")before any arithmetic."No valid data found"error rather than raising.The diagnostic line, when triggered, looks like:
```
Warning: dropping 2 row(s) with non-numeric values in required columns.
Affected file(s): milvus_benchmark_p2.csv, milvus_benchmark_p8.csv.
See #543.
```
Tests
New file:
vdb_benchmark/tests/tests/test_issue_543_csv_coercion.py— 8 tests:test_calculate_statistics_recovers_from_reporter_malformed_rowbatch_time_seconds=True, leading + trailing columns empty). Asserts no crash and stats returned.test_calculate_statistics_drops_rows_with_non_numeric_values[…]× 5timestamp,batch_time_seconds,batch_size,avg_query_time_seconds) must not crash.test_calculate_statistics_returns_error_when_every_row_is_malformedtest_calculate_statistics_unchanged_for_clean_csvsTest plan