[GH-2830] Improve Geography query support - core#2831
[GH-2830] Improve Geography query support - core#2831zhangfengcdt wants to merge 29 commits intoapache:masterfrom
Conversation
…ation with lazy-parsed JTS and S2 caches. This enables GeoArrow-compatible serialization while maintaining full S2 functionality on demand.
- Functions.distance(Geography, Geography) → Spheroid.distance() — returns meters - Functions.area(Geography) → Spheroid.area() — returns m² - Functions.length(Geography) → Spheroid.length() — returns meters - All route through WKBGeography.getJTSGeometry() (no S2 parse needed)
also, added getShapeIndexGeography() as a third lazy cache
- null type ambiguity - level 1 functions missing Geography dispatch
…hole winding order
|
@zhangfengcdt Is this PR ready for review? It has lots of unnecessary content (e.g., the benchmark folder). Please also break this huge PR to several small pieces so we can review piece by piece. |
I am still working on it. I will clean up the benchmark codes and also for this PR, it will focus on building the core architecture of the WKB-based Geography serialization with cached S2. I will keep a few core ST functions and tests and move other to individual PRs following the merging. |
|
ST Function Performance: Geography vs Geometry (cached objects, ns/op)
|
paleolimbot
left a comment
There was a problem hiding this comment.
This seems like it is headed in the right direction!
The WKBGeography is the right direction I think (In C++ I call this a GeoArrowGeography and it's slightly more general but the same idea). As written, I am not sure it is able to make anything faster (I am guessing that Spark already handles filter()s and won't materialize a Java object unless it will actually be used for something).
This may not be true in Java, but in C++ the S2Polygon and maybe S2Polyline has an internal index for itself and also one for each loop (lazily constructed, but almost always needed for initializing from WKB to figure out the nesting of shells/holes). The optimization of implementing the S2Shape interface directly on WKB is pretty much 100% to avoid those extra index builds (so that for any given function there's at most one shape index per object that is built). I don't know how flexible the S2Shape is in Java (in C++ implementing it was not too bad). I think you would need to do something similar to see improved performance (but maybe the first thing you want to do is get the functions and tests in place).
| new org.locationtech.jts.io.WKBWriter( | ||
| 2, org.locationtech.jts.io.ByteOrderValues.BIG_ENDIAN); |
There was a problem hiding this comment.
You probably want little endian here?
| if (result == null) { | ||
| try { | ||
| WKBReader reader = new WKBReader(); | ||
| result = reader.read(wkbBytes); |
There was a problem hiding this comment.
I am guessing that this line is very expensive (it is in C++). The recent redo of s2geography in C++ that I just did is mostly about avoiding this line (e.g., by manually iterating over edges and using lower-level S2 primitives.
| @Override | ||
| public int dimension() { | ||
| return getS2Geography().dimension(); | ||
| } |
There was a problem hiding this comment.
This one you can get from the first few WKB bytes
| @Override | ||
| public int numShapes() { | ||
| return getS2Geography().numShapes(); | ||
| } | ||
|
|
||
| @Override | ||
| public S2Shape shape(int id) { | ||
| return getS2Geography().shape(id); | ||
| } |
There was a problem hiding this comment.
I optimized this piece with a few hundred lines of code implementing the S2Shape interface on top of WKB. If you have this, you can avoid the Geography entirely (because actual functions usually need a ShapeIndex).
| @Override | ||
| public S2Region region() { | ||
| return getS2Geography().region(); | ||
| } |
There was a problem hiding this comment.
I optimized this one for points using the S2PointRegion for points to avoid constructing Geography. That might not matter here (in general when implementing the functions I also made every attempt to avoid accessing the Region to avoid the index build)
| @Override | ||
| public void getCellUnionBound(List<com.google.common.geometry.S2CellId> cellIds) { | ||
| getS2Geography().getCellUnionBound(cellIds); | ||
| } |
There was a problem hiding this comment.
This one can be avoided for a Point by just returning its cell
| /** | ||
| * Geometry-to-geometry geodesic distance in meters. Uses S2ClosestEdgeQuery for true minimum | ||
| * distance between any two points on the geometries (not centroid-to-centroid). Consistent with | ||
| * sedona-db's s2_distance implementation. | ||
| */ | ||
| public static Double distance(Geography g1, Geography g2) { | ||
| if (g1 == null || g2 == null) return null; | ||
| Distance dist = new Distance(); | ||
| double radians = dist.S2_distance(toShapeIndex(g1), toShapeIndex(g2)); | ||
| return radiansToMeters(radians); | ||
| } |
There was a problem hiding this comment.
It was a bit of a rabbit hole, but this is now several hundred lines in C++ to avoid building the shape index in all cases except (1) very long linestrings and (2) polygons
| /** Spherical containment test using S2 boolean operations. */ | ||
| public static boolean contains(Geography g1, Geography g2) { | ||
| if (g1 == null || g2 == null) return false; | ||
| Predicates pred = new Predicates(); | ||
| return pred.S2_contains(toShapeIndex(g1), toShapeIndex(g2), s2Options()); | ||
| } |
There was a problem hiding this comment.
This one has a similar array of optimizations for small geometries, including for small polygons (I probably should have put the small polygon optimization in the distance path as well but I ran out of will)
Did you read the Contributor Guide?
Is this PR related to a ticket?
[GH-XXX] my subject. Closes #<issue_number>What changes were proposed in this PR?
Implements WKB-based Geography serialization (Option B: WKB with Cached S2) and a full set of Geography ST functions.
Core architecture:
Geography functions (3 new):
Docs: API docs for all 3 new functions in docs/api/sql/geography/
Note: Geography-aware spatial join partitioning using S2 cells will be in a separate PR
How was this patch tested?
Did this PR include necessary documentation updates?