Skip to content

[GH-2830] Improve Geography query support - core#2831

Open
zhangfengcdt wants to merge 29 commits intoapache:masterfrom
zhangfengcdt:feature/geography.support
Open

[GH-2830] Improve Geography query support - core#2831
zhangfengcdt wants to merge 29 commits intoapache:masterfrom
zhangfengcdt:feature/geography.support

Conversation

@zhangfengcdt
Copy link
Copy Markdown
Member

@zhangfengcdt zhangfengcdt commented Apr 8, 2026

Did you read the Contributor Guide?

Is this PR related to a ticket?

  • Yes, and the PR name follows the format [GH-XXX] my subject. Closes #<issue_number>

What changes were proposed in this PR?

Implements WKB-based Geography serialization (Option B: WKB with Cached S2) and a full set of Geography ST functions.

Core architecture:

  • WKBGeography — stores WKB bytes as primary representation with lazy-parsed JTS, S2, and ShapeIndex caches (double-checked locking for thread safety)
  • GeographyWKBSerializer — WKB serializer with 0xFF format byte, backward-compatible with legacy S2-native format
  • GeographyUDT, implicits.scala, GeometrySerde — switched to WKBSerializer for all serialization paths

Geography functions (3 new):

  • Level 1 (JTS): ST_NPoints
  • Level 2 (JTS + Spheroid): ST_Distance
  • Level 3 (S2): ST_Contains

Docs: API docs for all 3 new functions in docs/api/sql/geography/

Note: Geography-aware spatial join partitioning using S2 cells will be in a separate PR

How was this patch tested?

  • all unit tests pass in common module (WKBGeographyTest, FunctionTest)
  • GeographyFunctionTest.scala — Spark SQL integration tests covering constructors, structural functions, metrics, predicates, DataFrame API, and serialization round-trips

Did this PR include necessary documentation updates?

  • Yes, I have updated the documentation.

@zhangfengcdt zhangfengcdt requested a review from jiayuasu as a code owner April 8, 2026 15:28
@jiayuasu
Copy link
Copy Markdown
Member

jiayuasu commented Apr 9, 2026

@zhangfengcdt Is this PR ready for review? It has lots of unnecessary content (e.g., the benchmark folder). Please also break this huge PR to several small pieces so we can review piece by piece.

@zhangfengcdt zhangfengcdt marked this pull request as draft April 9, 2026 22:12
@zhangfengcdt
Copy link
Copy Markdown
Member Author

@zhangfengcdt Is this PR ready for review? It has lots of unnecessary content (e.g., the benchmark folder). Please also break this huge PR to several small pieces so we can review piece by piece.

I am still working on it. I will clean up the benchmark codes and also for this PR, it will focus on building the core architecture of the WKB-based Geography serialization with cached S2. I will keep a few core ST functions and tests and move other to individual PRs following the merging.

@zhangfengcdt zhangfengcdt changed the title [GH-2830] Improve Geography query support - functions [GH-2830] Improve Geography query support - core Apr 11, 2026
@zhangfengcdt
Copy link
Copy Markdown
Member Author

ST Function Performance: Geography vs Geometry (cached objects, ns/op)

  • ST_NPoints (Level 1 — JTS accessor)
  ┌─────────────────────┬───────────┬──────────┬───────┐
  │        Shape        │ Geography │ Geometry │ Ratio │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Point               │         2 │        2 │    1x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ LineString (16 vtx) │         2 │        2 │    1x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Polygon (16 vtx)    │         2 │        2 │    1x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Polygon (64 vtx)    │         2 │        2 │    1x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Polygon (500 vtx)   │         2 │        2 │    1x │
  └─────────────────────┴───────────┴──────────┴───────┘
  • ST_Distance (Level 2 — S2 geodesic distance)
  ┌─────────────────────┬───────────┬──────────┬───────┐
  │        Shape        │ Geography │ Geometry │ Ratio │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Point               │       269 │       12 │   22x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ LineString (16 vtx) │     1,576 │      373 │  4.2x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Polygon (16 vtx)    │     1,419 │      613 │  2.3x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Polygon (64 vtx)    │    69,279 │    3,874 │   18x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Polygon (500 vtx)   │   224,696 │  129,518 │  1.7x │
  └─────────────────────┴───────────┴──────────┴───────┘
  • ST_Contains (Level 3 — S2 predicate)
  ┌─────────────────────┬───────────┬──────────┬───────┐
  │        Shape        │ Geography │ Geometry │ Ratio │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Point               │       284 │        8 │   36x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ LineString (16 vtx) │       664 │        8 │   83x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Polygon (16 vtx)    │       684 │        8 │   86x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Polygon (64 vtx)    │       677 │        8 │   87x │
  ├─────────────────────┼───────────┼──────────┼───────┤
  │ Polygon (500 vtx)   │       703 │        8 │   88x │
  └─────────────────────┴───────────┴──────────┴───────┘


@zhangfengcdt zhangfengcdt marked this pull request as ready for review April 13, 2026 15:58
Copy link
Copy Markdown
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it is headed in the right direction!

The WKBGeography is the right direction I think (In C++ I call this a GeoArrowGeography and it's slightly more general but the same idea). As written, I am not sure it is able to make anything faster (I am guessing that Spark already handles filter()s and won't materialize a Java object unless it will actually be used for something).

This may not be true in Java, but in C++ the S2Polygon and maybe S2Polyline has an internal index for itself and also one for each loop (lazily constructed, but almost always needed for initializing from WKB to figure out the nesting of shells/holes). The optimization of implementing the S2Shape interface directly on WKB is pretty much 100% to avoid those extra index builds (so that for any given function there's at most one shape index per object that is built). I don't know how flexible the S2Shape is in Java (in C++ implementing it was not too bad). I think you would need to do something similar to see improved performance (but maybe the first thing you want to do is get the functions and tests in place).

Comment on lines +84 to +85
new org.locationtech.jts.io.WKBWriter(
2, org.locationtech.jts.io.ByteOrderValues.BIG_ENDIAN);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably want little endian here?

if (result == null) {
try {
WKBReader reader = new WKBReader();
result = reader.read(wkbBytes);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am guessing that this line is very expensive (it is in C++). The recent redo of s2geography in C++ that I just did is mostly about avoiding this line (e.g., by manually iterating over edges and using lower-level S2 primitives.

Comment on lines +169 to +172
@Override
public int dimension() {
return getS2Geography().dimension();
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one you can get from the first few WKB bytes

Comment on lines +174 to +182
@Override
public int numShapes() {
return getS2Geography().numShapes();
}

@Override
public S2Shape shape(int id) {
return getS2Geography().shape(id);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I optimized this piece with a few hundred lines of code implementing the S2Shape interface on top of WKB. If you have this, you can avoid the Geography entirely (because actual functions usually need a ShapeIndex).

Comment on lines +184 to +187
@Override
public S2Region region() {
return getS2Geography().region();
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I optimized this one for points using the S2PointRegion for points to avoid constructing Geography. That might not matter here (in general when implementing the functions I also made every attempt to avoid accessing the Region to avoid the index build)

Comment on lines +189 to +192
@Override
public void getCellUnionBound(List<com.google.common.geometry.S2CellId> cellIds) {
getS2Geography().getCellUnionBound(cellIds);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one can be avoided for a Point by just returning its cell

Comment on lines +89 to +99
/**
* Geometry-to-geometry geodesic distance in meters. Uses S2ClosestEdgeQuery for true minimum
* distance between any two points on the geometries (not centroid-to-centroid). Consistent with
* sedona-db's s2_distance implementation.
*/
public static Double distance(Geography g1, Geography g2) {
if (g1 == null || g2 == null) return null;
Distance dist = new Distance();
double radians = dist.S2_distance(toShapeIndex(g1), toShapeIndex(g2));
return radiansToMeters(radians);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a bit of a rabbit hole, but this is now several hundred lines in C++ to avoid building the shape index in all cases except (1) very long linestrings and (2) polygons

Comment on lines +103 to +108
/** Spherical containment test using S2 boolean operations. */
public static boolean contains(Geography g1, Geography g2) {
if (g1 == null || g2 == null) return false;
Predicates pred = new Predicates();
return pred.S2_contains(toShapeIndex(g1), toShapeIndex(g2), s2Options());
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one has a similar array of optimizations for small geometries, including for small polygons (I probably should have put the small polygon optimization in the distance path as well but I ran out of will)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants