Skip to content

rust/sedona-spatial-join: Improve memory estimation and serialization for EvaluatedGeometryArray #745

@paleolimbot

Description

@paleolimbot

There is at least one place where we may be double counting a large chunk of memory in the EvaluatedBatch:

        // NOTE: sometimes `geom_array` will reuse the memory of `batch`, especially when
        // the expression for evaluating the geometry is a simple column reference. In this case,
        // the in_mem_size will be overestimated. It is a conservative estimation so there's no risk
        // of running out of memory because of underestimation.
        let record_batch_size = get_record_batch_memory_size(&self.batch)?;
        let geom_array_size = self.geom_array.in_mem_size()?;
        Ok(record_batch_size + geom_array_size)

This might explain an issue we ran across when trying to enable this by default where we determined we'd need to set the memory pool size to more than twice as much memory as was required for a join used in the released post (my reading of that comment is that we would be reserving ~2x as much memory as was required for most joins but I have not investigated).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions