Skip to content

[PoC]: Yet another implementation of PARQUET-2249: Introduce IEEE 754 total order#9619

Draft
etseidl wants to merge 9 commits intoapache:mainfrom
etseidl:total_order_514
Draft

[PoC]: Yet another implementation of PARQUET-2249: Introduce IEEE 754 total order#9619
etseidl wants to merge 9 commits intoapache:mainfrom
etseidl:total_order_514

Conversation

@etseidl
Copy link
Copy Markdown
Contributor

@etseidl etseidl commented Mar 26, 2026

Which issue does this PR close?

Rationale for this change

This takes the implementation done by @Xuanwo (#8158) and updates it to the new thrift format and recent changes to the original proposal (apache/parquet-format#514).

What changes are included in this PR?

Adds needed thrift structures as well as NaN counts for pages and column chunks.

Are these changes tested?

Yes, new tests added (more may be needed).

Are there any user-facing changes?

Yes, this is a breaking change.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Mar 26, 2026
@etseidl etseidl added the api-change Changes to the arrow API label Mar 26, 2026
// For floating point we need to compare NaN values until we encounter a non-NaN
// value which then becomes the new min/max. After this, only non-NaN values are
// evaluated. If all values are NaN, then the min/max NaNs as determined by
// IEEE 754 total order are returned.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has me a bit worried. I need to do some benchmarking to make sure all the complicated NaN logic isn't killing performance.

Copy link
Copy Markdown
Contributor

@jhorstmann jhorstmann Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, this method is on the hot path. I had a look at optimizing it, but could not get the compiler to generate nice auto-vectorized code for nan-handling yet. I think we can try optimizations in a followup, it would be more important to get the semantics correct first and make sure there are tests for edge cases.

In that regard, about this requirement

If all values are NaN, then the min/max NaNs as determined by
// IEEE 754 total order are returned.

Does the current code correctly distinguish different NaN payloads according to their sign and bit patterns?

(Solved, github was hiding the changes to compare_greater in mod.rs)


fn update_min<T: ParquetValueType>(descr: &ColumnDescriptor, val: &T, min: &mut Option<T>) {
update_stat::<T, _>(descr, val, min, |cur| compare_greater(descr, cur, val))
if min.is_none() {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes here and to update_max also worry me. It's just complicated because we can't simply exclude NaN. In the case of all-NaN we have to properly order them so we get the min and max NaN, but if a non-NaN shows up we have to start over. Same thing in get_min_max().

This is why I prefer the other solution of simply using total order and dealing with the possibility of NaN in the statistics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change Changes to the arrow API parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Parquet] Prototype: PARQUET-2249: Introduce IEEE 754 total order & NaN-counts #514

2 participants