[GH-2830] Improve Geography query support - core by zhangfengcdt · Pull Request #2831 · apache/sedona

zhangfengcdt · 2026-04-08T15:28:58Z

Did you read the Contributor Guide?

Yes, I have read the Contributor Rules and Contributor Development Guide

Is this PR related to a ticket?

Yes, and the PR name follows the format [GH-XXX] my subject. Closes #<issue_number>

What changes were proposed in this PR?

Implements WKB-based Geography serialization (Option B: WKB with Cached S2) and a full set of Geography ST functions.

Core architecture:

WKBGeography — stores WKB bytes as primary representation with lazy-parsed JTS, S2, and ShapeIndex caches (double-checked locking for thread safety)
GeographyWKBSerializer — WKB serializer with 0xFF format byte, backward-compatible with legacy S2-native format
GeographyUDT, implicits.scala, GeometrySerde — switched to WKBSerializer for all serialization paths

Geography functions (3 new):

Level 1 (JTS): ST_NPoints
Level 2 (JTS + Spheroid): ST_Distance
Level 3 (S2): ST_Contains

Docs: API docs for all 3 new functions in docs/api/sql/geography/

Note: Geography-aware spatial join partitioning using S2 cells will be in a separate PR

How was this patch tested?

all unit tests pass in common module (WKBGeographyTest, FunctionTest)
GeographyFunctionTest.scala — Spark SQL integration tests covering constructors, structural functions, metrics, predicates, DataFrame API, and serialization round-trips

Did this PR include necessary documentation updates?

Yes, I have updated the documentation.

…ation with lazy-parsed JTS and S2 caches. This enables GeoArrow-compatible serialization while maintaining full S2 functionality on demand.

- Functions.distance(Geography, Geography) → Spheroid.distance() — returns meters - Functions.area(Geography) → Spheroid.area() — returns m² - Functions.length(Geography) → Spheroid.length() — returns meters - All route through WKBGeography.getJTSGeometry() (no S2 parse needed)

also, added getShapeIndexGeography() as a third lazy cache

- null type ambiguity - level 1 functions missing Geography dispatch

…hole winding order

jiayuasu · 2026-04-09T01:30:01Z

@zhangfengcdt Is this PR ready for review? It has lots of unnecessary content (e.g., the benchmark folder). Please also break this huge PR to several small pieces so we can review piece by piece.

zhangfengcdt · 2026-04-09T22:14:13Z

@zhangfengcdt Is this PR ready for review? It has lots of unnecessary content (e.g., the benchmark folder). Please also break this huge PR to several small pieces so we can review piece by piece.

I am still working on it. I will clean up the benchmark codes and also for this PR, it will focus on building the core architecture of the WKB-based Geography serialization with cached S2. I will keep a few core ST functions and tests and move other to individual PRs following the merging.

…unctions

zhangfengcdt · 2026-04-11T16:45:55Z

ST Function Performance: Geography vs Geometry (cached objects, ns/op)

ST_NPoints (Level 1 — JTS accessor)

  ┌─────────────────────┬───────────┬──────────┬───────┐
  │        Shape        │ Geography │ Geometry │ Ratio │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Point               │         2 │        2 │    1x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ LineString (16 vtx) │         2 │        2 │    1x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Polygon (16 vtx)    │         2 │        2 │    1x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Polygon (64 vtx)    │         2 │        2 │    1x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Polygon (500 vtx)   │         2 │        2 │    1x │
  └─────────────────────┴───────────┴──────────┴───────┘

ST_Distance (Level 2 — S2 geodesic distance)

  ┌─────────────────────┬───────────┬──────────┬───────┐
  │        Shape        │ Geography │ Geometry │ Ratio │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Point               │       269 │       12 │   22x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ LineString (16 vtx) │     1,576 │      373 │  4.2x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Polygon (16 vtx)    │     1,419 │      613 │  2.3x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Polygon (64 vtx)    │    69,279 │    3,874 │   18x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Polygon (500 vtx)   │   224,696 │  129,518 │  1.7x │
  └─────────────────────┴───────────┴──────────┴───────┘

ST_Contains (Level 3 — S2 predicate)

  ┌─────────────────────┬───────────┬──────────┬───────┐
  │        Shape        │ Geography │ Geometry │ Ratio │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Point               │       284 │        8 │   36x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ LineString (16 vtx) │       664 │        8 │   83x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Polygon (16 vtx)    │       684 │        8 │   86x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Polygon (64 vtx)    │       677 │        8 │   87x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Polygon (500 vtx)   │       703 │        8 │   88x │
  └─────────────────────┴───────────┴──────────┴───────┘

paleolimbot

This seems like it is headed in the right direction!

The WKBGeography is the right direction I think (In C++ I call this a GeoArrowGeography and it's slightly more general but the same idea). As written, I am not sure it is able to make anything faster (I am guessing that Spark already handles filter()s and won't materialize a Java object unless it will actually be used for something).

This may not be true in Java, but in C++ the S2Polygon and maybe S2Polyline has an internal index for itself and also one for each loop (lazily constructed, but almost always needed for initializing from WKB to figure out the nesting of shells/holes). The optimization of implementing the S2Shape interface directly on WKB is pretty much 100% to avoid those extra index builds (so that for any given function there's at most one shape index per object that is built). I don't know how flexible the S2Shape is in Java (in C++ implementing it was not too bad). I think you would need to do something similar to see improved performance (but maybe the first thing you want to do is get the functions and tests in place).

paleolimbot · 2026-04-14T13:55:53Z