Skip to content

How to enable statistics for string columns? #5270

@jonashaag

Description

@jonashaag

Describe the bug

I'm using the https://github.com/pacman82/odbc2parquet library that is based on this crate.

I observe that statistics like min/max are not written for string columns:

In [4]: pq.ParquetFile("/tmp/o2p").metadata.row_group(0).column(1)
Out[4]:
<pyarrow._parquet.ColumnChunkMetaData object at 0x1033c1080>
  file_offset: 1123
  file_path:
  physical_type: BYTE_ARRAY
  num_values: 100
  path_in_schema: XXX
  is_stats_set: True
  statistics:
    <pyarrow._parquet.Statistics object at 0x103476070>
      has_min_max: False
      min: None
      max: None
      null_count: None
      distinct_count: None
      num_values: 100
      physical_type: BYTE_ARRAY
      logical_type: String
      converted_type (legacy): UTF8
  compression: ZSTD
  encodings: ('PLAIN', 'RLE', 'RLE_DICTIONARY')
  has_dictionary_page: True
  dictionary_page_offset: 394
  data_page_offset: 938
  total_compressed_size: 729
  total_uncompressed_size: 2993

Relevant code: https://github.com/pacman82/odbc2parquet/blob/b571cad6fae1b58e1aab8348f14b32f20d6ec165/src/query/parquet_writer.rs#L47

To Reproduce

Use odbc2parquet to download any table that contains a string column

Expected behavior

Should have min/max statistics.

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugparquetChanges to the parquet crate

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions