Skip to content

How to configure/tweak Parquet for storing and retrieving vector embeddings? #553

@rahil-c

Description

@rahil-c

Describe the usage question you have. Please include as many useful details as possible.

Hi Parquet community,

Im an engineer that works in open source data infra primarily in the java ecosystem. I wanted to ask the community some questions for how to better configure/tune parquet today for storing and retrieval of vectors. Also interested in the future work around this item as well but for now curious on what users can tweak.

Typically from what i've seen most models will generate vector embeddings as an array of floating point values, with dimensions around 700-1500 elements (taking up around 3KB - 6KB per vector), so my questions will be based on this input.

  1. Since there is currently no native logical VECTOR type within Parquet, from the existing types what is the recommended data to use type for writing vectors(assumption is most users today would try parquet's LIST with FLOAT but is there others way to represent this better)? Is there plans for Parquet to add a type such as this in the future?
  2. Since vectors are high cardinality encodings such as DICTIONARY, RLE might not be as useful so is there a recommended encoding to leverage?
  3. Is there a recommended compression codec to use for vectors? If not, is there a way to disable compression per column within parquet java?
  4. Should users be disabling stats on these columns?
  5. Is there a recommendation for tuning row group and page size for vectors?

Thanks again for your assistance and help, if there is any roadmap specifically you can point me to would be highly appreciated in terms of changes happening in parquet around this.

cc @julienledem @emkornfield

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions