diff --git a/LogicalTypes.md b/LogicalTypes.md index 78fdf293..820320dc 100644 --- a/LogicalTypes.md +++ b/LogicalTypes.md @@ -254,7 +254,10 @@ Used in contexts where precision is traded off for smaller footprint and potenti The primitive type is a 2-byte `FIXED_LEN_BYTE_ARRAY`. -The sort order for `FLOAT16` is signed (with special handling of NANs and signed zeros); it uses the same [logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and `DOUBLE`. +Like `FLOAT` and `DOUBLE`, the sort order for `FLOAT16` is signed with special +handling for NaNs and signed zeros. Writers should use IEEE754TotalOrder for +consistent handling of these edge cases. See the `ColumnOrder` union in the +[Thrift definition](src/main/thrift/parquet.thrift) for details. ## Temporal Types diff --git a/README.md b/README.md index d3482093..d398ac4f 100644 --- a/README.md +++ b/README.md @@ -158,7 +158,9 @@ documented in [LogicalTypes.md][logical-types]. Parquet stores min/max statistics at several levels (such as Column Chunk, Column Index, and Data Page). These statistics are according to a sort order, which is defined for each column in the file footer. Parquet supports common -sort orders for logical and primitive types. The details are documented in the +sort orders for logical and primitive types and also special orders for types +with potentially ambiguous semantics (e.g., NaN ordering for floating point +types). The details are documented in the [Thrift definition](src/main/thrift/parquet.thrift) in the `ColumnOrder` union. ## Nested Encoding diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index 883264c3..a93425f9 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -309,6 +309,13 @@ struct Statistics { 7: optional bool is_max_value_exact; /** If true, min_value is the actual minimum value for a column */ 8: optional bool is_min_value_exact; + /** + * Count of NaN values in the column; only present if physical type is FLOAT + * or DOUBLE, or logical type is FLOAT16. + * If this field is not present, readers MUST assume NaNs may be present + * (i.e. MUST assume nan_count > 0 and MAY NOT assume nan_count == 0). + */ + 9: optional i64 nan_count; } /** Empty structs to use as logical type annotations */ @@ -1050,6 +1057,9 @@ struct RowGroup { /** Empty struct to signal the order defined by the physical or logical type */ struct TypeDefinedOrder {} +/** Empty struct to signal IEEE 754 total order for floating point types */ +struct IEEE754TotalOrder {} + /** * Union to specify the order used for the min_value and max_value fields for a * column. This union takes the role of an enhanced enum that allows rich @@ -1058,6 +1068,7 @@ struct TypeDefinedOrder {} * Possible values are: * * TypeDefinedOrder - the column uses the order defined by its logical or * physical type (if there is no logical type). + * * IEEE754TotalOrder - the floating point column uses IEEE 754 total order. * * If the reader does not support the value of this union, min and max stats * for this column should be ignored. @@ -1111,23 +1122,99 @@ union ColumnOrder { * 64-bit signed integer (nanos) * See https://github.com/apache/parquet-format/issues/502 for more details * - * (*) Because the sorting order is not specified properly for floating - * point values (relations vs. total ordering) the following + * (*) Because TYPE_ORDER is ambiguous for floating point types due to + * underspecified handling of NaN and -0/+0, it is recommended that writers + * use IEEE_754_TOTAL_ORDER for these types. + * + * If TYPE_ORDER is used for floating point types, then the following * compatibility rules should be applied when reading statistics: * - If the min is a NaN, it should be ignored. * - If the max is a NaN, it should be ignored. + * - If the nan_count field is set, a reader can compute + * nan_count + null_count == num_values to deduce whether all non-null + * values are NaN. * - If the min is +0, the row group may contain -0 values as well. * - If the max is -0, the row group may contain +0 values as well. * - When looking for NaN values, min and max should be ignored. + * If the nan_count field is set, it can be used to check whether + * NaNs are present. * * When writing statistics the following rules should be followed: - * - NaNs should not be written to min or max statistics fields. + * - Always set the nan_count field for floating point types, even if + * it is zero. + * - NaNs should not be written to min or max statistics fields except + * in the column index when a page contains only NaN values. In this + * case, since min_values and max_values are required, a NaN value + * must be written. * - If the computed max value is zero (whether negative or positive), * `+0.0` should be written into the max statistics field. * - If the computed min value is zero (whether negative or positive), * `-0.0` should be written into the min statistics field. */ 1: TypeDefinedOrder TYPE_ORDER; + + /* + * The floating point type is ordered according to the totalOrder predicate, + * as defined in section 5.10 of IEEE-754 (2008 revision). Only columns of + * physical type FLOAT or DOUBLE, or logical type FLOAT16 may use this ordering. + * + * Intuitively, this orders floats mathematically, but defines -0 to be less + * than +0, -NaN to be less than anything else, and +NaN to be greater than + * anything else. It also defines an order between different bit representations + * of the same value. + * + * The formal definition is as follows: + * a) If xy, totalOrder(x, y) is false. + * c) If x=y: + * 1) totalOrder(−0, +0) is true. + * 2) totalOrder(+0, −0) is false. + * 3) If x and y represent the same floating-point datum: + * i) If x and y have negative sign, totalOrder(x, y) is true if and + * only if the exponent of x ≥ the exponent of y + * ii) otherwise totalOrder(x, y) is true if and only if the exponent + * of x ≤ the exponent of y. + * d) If x and y are unordered numerically because x or y is NaN: + * 1) totalOrder(−NaN, y) is true where −NaN represents a NaN with + * negative sign bit and y is a non-NaN floating-point number. + * 2) totalOrder(x, +NaN) is true where +NaN represents a NaN with + * positive sign bit and x is a non-NaN floating-point number. + * 3) If x and y are both NaNs, then totalOrder reflects a total ordering + * based on: + * i) negative sign orders below positive sign + * ii) signaling orders below quiet for +NaN, reverse for −NaN + * iii) lesser payload, when regarded as an integer, orders below + * greater payload for +NaN, reverse for −NaN. + * + * Note that this ordering can be implemented efficiently in software by bit-wise + * operations on the integer representation of the floating point values. + * E.g., this is a possible implementation for DOUBLE in Rust: + * + * pub fn totalOrder(x: f64, y: f64) -> bool { + * let mut x_int = x.to_bits() as i64; + * let mut y_int = y.to_bits() as i64; + * x_int ^= (((x_int >> 63) as u64) >> 1) as i64; + * y_int ^= (((y_int >> 63) as u64) >> 1) as i64; + * return x_int <= y_int; + * } + * + * When writing statistics for columns with this order, the following rules + * must be followed: + * - Writing the nan_count field is mandatory when using this ordering. + * - Min and max statistics must contain the smallest and largest non-NaN + * values respectively, or if all non-null values are NaN, the smallest and + * largest NaN values as defined by IEEE 754 total order. + * + * When reading statistics for columns with this order, the following rules + * should be followed: + * - Readers should consult the nan_count field to determine whether NaNs + * are present. + * - A reader can compute nan_count + null_count == num_values to deduce + * whether all non-null values are NaN. In the page index, which does not + * have a num_values field, the presence of a NaN value in min_values + * or max_values indicates that all non-null values are NaN. + */ + 2: IEEE754TotalOrder IEEE_754_TOTAL_ORDER; } struct PageLocation { @@ -1199,6 +1286,18 @@ struct ColumnIndex { * Such more compact values must still be valid values within the column's * logical type. Readers must make sure that list entries are populated before * using them by inspecting null_pages. + * + * For columns of physical type FLOAT or DOUBLE, or logical type FLOAT16, + * NaN values are not to be included in these bounds. If all non-null values + * of a page are NaN, then a writer must do the following: + * - If the order of this column is TYPE_ORDER, then no column index + * must be written for this column chunk. While this is unfortunate for + * performance, it is necessary to avoid conflict with legacy files that + * still included NaN in min_values and max_values even if the page had + * non-NaN values. To mitigate this, IEEE754_TOTAL_ORDER is recommended. + * - If the order of this column is IEEE754_TOTAL_ORDER, then min_values[i] + * and max_values[i] of that page must be set to the smallest and largest + * NaN values as defined by IEEE 754 total order. */ 2: required list min_values 3: required list max_values @@ -1240,6 +1339,15 @@ struct ColumnIndex { * Same as repetition_level_histograms except for definitions levels. **/ 7: optional list definition_level_histograms; + + /** + * A list containing the number of NaN values for each page. Only present + * for columns of physical type FLOAT or DOUBLE, or logical type FLOAT16. + * If this field is not present, readers MUST assume that there might be + * NaN values in any page. + */ + 8: optional list nan_counts + } struct AesGcmV1 {