[PoC]: Yet another implementation of PARQUET-2249: Introduce IEEE 754 total order#9619
[PoC]: Yet another implementation of PARQUET-2249: Introduce IEEE 754 total order#9619etseidl wants to merge 9 commits intoapache:mainfrom
Conversation
| // For floating point we need to compare NaN values until we encounter a non-NaN | ||
| // value which then becomes the new min/max. After this, only non-NaN values are | ||
| // evaluated. If all values are NaN, then the min/max NaNs as determined by | ||
| // IEEE 754 total order are returned. |
There was a problem hiding this comment.
This has me a bit worried. I need to do some benchmarking to make sure all the complicated NaN logic isn't killing performance.
There was a problem hiding this comment.
Agree, this method is on the hot path. I had a look at optimizing it, but could not get the compiler to generate nice auto-vectorized code for nan-handling yet. I think we can try optimizations in a followup, it would be more important to get the semantics correct first and make sure there are tests for edge cases.
If all values are NaN, then the min/max NaNs as determined by
// IEEE 754 total order are returned.
Does the current code correctly distinguish different NaN payloads according to their sign and bit patterns?
(Solved, github was hiding the changes to compare_greater in mod.rs)
|
|
||
| fn update_min<T: ParquetValueType>(descr: &ColumnDescriptor, val: &T, min: &mut Option<T>) { | ||
| update_stat::<T, _>(descr, val, min, |cur| compare_greater(descr, cur, val)) | ||
| if min.is_none() { |
There was a problem hiding this comment.
Changes here and to update_max also worry me. It's just complicated because we can't simply exclude NaN. In the case of all-NaN we have to properly order them so we get the min and max NaN, but if a non-NaN shows up we have to start over. Same thing in get_min_max().
This is why I prefer the other solution of simply using total order and dealing with the possibility of NaN in the statistics.
Which issue does this PR close?
Rationale for this change
This takes the implementation done by @Xuanwo (#8158) and updates it to the new thrift format and recent changes to the original proposal (apache/parquet-format#514).
What changes are included in this PR?
Adds needed thrift structures as well as NaN counts for pages and column chunks.
Are these changes tested?
Yes, new tests added (more may be needed).
Are there any user-facing changes?
Yes, this is a breaking change.