Skip to content

Optimize native shuffle to write schema once per output partition instead of once per batch #2928

@andygrove

Description

@andygrove

What is the problem the feature request solves?

Native shuffle currently encodes the name of the compression codec and the IPC schema once per batch. Storing the schema per batch was originally a requirement because the schema could vary between batches due to dictionary encoding but this is no longer the case.

Describe the potential solution

Write the compression codec name and schema once per partittion. This can be implemented in ShuffleBlockWriter::try_new. Update ShuffleBlockWriter::write_batch to no longer write the header per batch.

Make corresponding changes in the reader NativeBlockDecoderIterator.scala.

We need to make sure that this approach works correctly with spilling as well.

Additional context

No response

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions