Enhance EPT reading with concurrency and other performance fixes#313
Open
agrigoriev wants to merge 50 commits into
Open
Enhance EPT reading with concurrency and other performance fixes#313agrigoriev wants to merge 50 commits into
agrigoriev wants to merge 50 commits into
Conversation
The originally-planned counting variant (custom httpuv app$call handler) deadlocks because httpuv R callbacks share the R event loop with the synchronous lasR exec, so the test must use the C-native staticPaths path. Metadata cache invariants are covered at the C++ layer in test-ept.R via cpp_ept_partition_inspect; this test exercises the remote parallel path end-to-end.
The previous guard ran inside the loop body (`if (d >= 16) break;`), which still evaluated the shift `1 << 32` once when d hit 16. Move the cap into the while condition so the shift is only evaluated for d < 16. Unreachable from the execute.cpp path (clamp keeps d <= 6) but reachable from cpp_ept_partition_inspect with a large target_partitions.
Lift the per-point ownership decision out of the two process() loops
into a static helper LASReptreader::strict_clip_decide that returns
DROP/CORE/BUFFERED. The two loops now call it instead of duplicating
the predicate inline.
Expose the predicate via cpp_strict_clip_decide in the test RCPP module
so R-level unit tests can exercise edge cases that integration data
rarely hits — exact partition midlines, the global-max-inclusive
carve-out, half-open xmin inclusion, zero-buffer collapse. Closes the
spec's test 5 ("strict-clip edge boundary policy") that Task 13's
plan dropped.
Strict-clip flags interior-boundary points BUFFERED, but summary.cpp and writelas.cpp recomputed inside_buffer() geometrically — reaching the opposite verdict at the boundary and double-counting points across adjacent partition chunks. Short-circuit on get_buffered() before the geometric fallback so the reader's flag wins. Adds cpp_summary_buffer_decide / cpp_writelas_buffer_decide unit-test helpers pinning the contract.
Adds Y-axis and circular assertions to the cpp_summary_buffer_decide / cpp_writelas_buffer_decide blocks, plus the case the production fix actually targets: flag=TRUE with px=xmax (non-global boundary), where the reader marks the point BUFFERED but inside_buffer returns false.
Adds queries_owner_xmax/ymax parallel vectors and a new rect add_query overload carrying the owner-max pair. get_chunk_with_query honors the override when set; NaN falls through to the catalog's xmax/ymax — no behavior change for existing callers. The semantic of chunk.catalog_xmax/ymax is now 'outer boundary beyond which the chunk includes the rightmost edge inclusively.' For full-extent reads, that's the catalog. For AOI sub-queries (next commit), it's the AOI's clamped outer bbox.
Preserves the parallel-vector size invariant when callers reset the catalog state before re-installing queries. Latent today (no caller exercises this path), but Task 6's partition_ept rebuild path will.
Extends the inspector to install user queries before invoking partition_ept and reports per-chunk shape_type, owner_xmax, owner_ymax alongside the existing bbox/strict_clip output. Lets the upcoming AOI partition tests assert circle passthrough, owner-bounds propagation, and mixed rect+circle behavior directly.
Lifts the queries.size() > 0 short-circuit. With-queries entry now splits into a hierarchy-failure passthrough and a partitioning tail (stubbed in this commit; filled in by the next commits). Adds a clone_shape free helper used by the upcoming rebuild. No behavior change for the no-queries path. Existing AOI reads still behave as before (single chunk, no partitioning); two regression tests pin circle and out-of-extent passthrough.
* Inspector now reads shape_type from queries[i]->type() (via new FileCollection::get_query_type accessor) rather than chunk.shape, which is zeroed by get_chunk_with_query for out-of-extent queries. The out-of-extent rect passthrough test now actually pins shape preservation; previously a destroyed Shape would have silently collapsed to "rect" via the UNKNOWN→default branch. * clone_shape moved out of anonymous namespace to bare static, matching the file's existing file-local helper style (is_remote_path, filename_from_url). * Restore the d<16 cap comment that explains the UB-prevention intent of commit 8919501.
Pins the with-queries fallback: a fixture where build_metadata succeeds (root readable, probe tile openable) but ensure_tiles throws (a non-root sub-page contains malformed JSON) must leave the user's AOI intact and emit a warning, never clobber it via set_chunk_size.
The existing ept-test-multi fixture's root hierarchy contains only positive-count leaves, so the skip_if guard fired and the test never ran. Patch the copied root to inject a -1 sub-page pointer ourselves, then write the malformed JSON to that path. build_metadata returns on the first existing leaf probe; ensure_tiles walks the whole hierarchy and now genuinely throws on the malformed sub-page.
* on.exit cleanup so the broken fixture doesn't leak across runs. * stopifnot(all(file.copy(...))) so a partial fixture copy errors loudly rather than producing a green test via an unrelated hierarchy warning. * Tighten the expect_match regex to bind specifically to the partition_ept with-queries fallback warning rather than any line that happens to contain "hierarchy".
For each rectangular AOI, emit one strict-clipped sub-query per non-empty grid cell that overlaps (cell ∩ query ∩ conf_bounds). Owner bounds (catalog_xmax/ymax) propagate from the CLAMPED aoi_bbox so the strict-clip rightmost-edge carve-out fires at the AOI's outer edge, not the catalog's — partitioned and un-partitioned AOI reads agree on boundary-coincident points. Non-rect queries, rects disjoint from conf_bounds, and rects whose cells are all empty pass through unchanged via clone_shape. Mixed rect+circle lists work naturally. Depth chosen so 4^d * aoi_ratio ~= target_partitions, capped at min(16, max_tile_depth) before shifting. Rebuild is exception-safe: replacement Shapes built off-side as unique_ptrs; raw-pointer staging vector reserved before any catalog mutation; commit phase is allocation-free and noexcept.
A rect AOI under concurrent-files with ncpu=4 produces identical summary results to the serial read (exact equality, no tolerance); the parallel write_las output (per-chunk files via glob template) sums to the same canonical point count as the serial summarise. Together these pin the load-bearing claim of the AOI partitioning work: ownership contract + clamped owner bounds eliminate boundary duplication. The serial write_las(single-file) path uses different keep_buffer defaults so cannot serve as a comparator here — the canonical AOI count comes from serial summarise.
… return Move the per-query owner-bounds assignment past the chunk.clear() that runs when no files overlap the AOI. Previously the values were set, then immediately wiped — harmless because the engine skips empty chunks via chunk.is_empty(), but a dead store that confused readers. Adds doc comments to Chunk::catalog_xmax/ymax (and the mirror fields on LASReptreader) clarifying that under AOI partitioning these hold the AOI's clamped outer bbox rather than the catalog's, which is what makes partition_ept and un-partitioned AOI reads agree on boundary-coincident points.
Per the lasR worktree convention, each worktree installs to its own .Rlib via 'R CMD INSTALL -l .Rlib .' to avoid clobbering sibling worktrees that may be testing different commits in parallel.
If new_q.emplace_back needs to reallocate and throws bad_alloc, a raw Shape* returned from clone_shape would leak. Returning unique_ptr makes the temporary self-cleaning if the push fails — matches the design goal of strong exception guarantee in the partition rebuild and the pattern already used for the replacement-path's freshly-allocated Rectangles.
Each add_query overload pushes to four parallel vectors in sequence (queries, queries_strict_clip, queries_owner_xmax, queries_owner_ymax). If the second push_back threw bad_alloc, queries would already have size N+1 while the side vectors had size N — a subsequent get_chunk_with_query(i=N) would read past queries_strict_clip and the two NaN-tracking vectors. Fix: reserve capacity in all four vectors before any push_back. If a reserve throws, the catalog state is unchanged and the unique_ptr holding the freshly-allocated Shape destroys it cleanly. After the reserves succeed, the push_backs are guaranteed allocation-free, so the four-element invariant holds atomically. Also consolidates the 5-arg rect overload to delegate into the 7-arg form rather than duplicating the push sequence.
The dense vector<int64_t> cell_points map sized 2^d × 2^d could reach ~32 GiB at d=16 — reachable from a sub-meter AOI in a deep EPT, since the AOI-area-weighted depth formula can pick d much larger than the no-queries path's env-clamped cap. Two changes, scoped to the with-queries path only (the no-queries path's d is bounded by LASR_EPT_PARTITIONS ≤ 4096 at the engine gate): 1. Float-safe depth selection: compare the desired depth against the integer cap in double-space before casting. log2 of a denormal ratio could overflow to inf, and (int)inf is UB. 2. Replace the dense int64_t grid with std::unordered_set<uint64_t> keyed by cy * grid_n + cx. Walk tiles only against the union of AOI cell ranges — shallow-tile splatting is pre-clipped to that union, bounding both memory and work to O(AOI footprint). Functionally identical to the previous emission: the dense map's values were only checked against zero, so a sparse set carries the same information.
Streaming-mode buffer classification called inside_buffer(xmin, ymin, ymax, ymax, ...) — the third argument should be xmax. Before the summary/writelas ownership-contract fix this was harmless because downstream stages ignored the flag and recomputed the geometry themselves. After the contract fix they honor the flag, so the buggy classification leaked into summary npoints and write_las output. The non-streaming in-memory path at readlas.cpp:105 and readpcd.cpp:116 already passed the correct bbox — only the streaming path was wrong.
Without these defaults, every /vsicurl/ tile open does sibling-directory HEAD probing (≈10× more HTTP requests than the actual range fetch), the VSI byte cache is off so overlapping AOI sub-queries re-fetch the same bytes, and HTTP/1.1 serialises range requests per host. The net effect on concurrent_files(4) against a real remote EPT (autzen-classified on entwine.io, 500 m × 500 m AOI, 142k points, cold cache, fresh subprocess per run, median of 3): before: serial 2.7 s warm / CF(4) 43 s — partition was 16× SLOWER after: serial 20 s cold / CF(4) 12 s — partition is 1.6× FASTER We set four config options once at add_ept_endpoint: GDAL_DISABLE_READDIR_ON_OPEN = EMPTY_DIR GDAL_HTTP_MULTIPLEX = YES VSI_CACHE = TRUE VSI_CACHE_SIZE = 67108864 # 64 MiB CPLSetConfigOption reads through to env on get(), so any user override (Sys.setenv or shell) wins — we only set when unset. Scope is process-wide for subsequent VSI ops, which matches how lasR uses GDAL.
…rmance FileCollection: suppress LASlib size warnings during EPT reads
If std::async throws std::system_error (e.g. process thread-limit hit under heavy outer parallelism), the original code had popped the next tile key off tile_queue *before* the throw — losing the tile from the read. open_next_tile would then return successfully (current_tile was already installed) but skip data on its next call. Wrap the launch in try/catch (std::system_error); on failure push the key back to the front of tile_queue and leave next_tile_future invalid. The consumer ignores invalid futures, so the next open_next_tile call falls through to the synchronous open path.
The previous patch suppressed LASlib's per-open VSIStatL HEAD warning via a process-wide LASR_SKIP_REMOTE_SIZE_WARN env var set during EPT add_endpoint. That env var is never cleared, so any unrelated remote LAS read in the same R session also lost the >500 MB size warning — a real footgun for users who read EPT first and then a big standalone remote .laz. Replace with a path check: LASlib skips the size HEAD only when the path matches the EPT convention (/ept-data/). The warning stays visible for every other remote read. The FileCollection setenv block and cstdlib include are no longer needed and are removed.
callback.cpp:348 was the last remaining filter-decision site that recomputed buffer status from geometry alone (summary.cpp/writelas.cpp were fixed in 0c22b43). Under strict-clip mode, the EPT reader flags interior-edge points as BUFFERED; inside_buffer() would call them CORE on the inclusive xmax/ymax edge. The drop_buffer flag in callback's "Update the LAS" loop therefore disagreed with the reader on partition midlines. Mirror the summary/writelas pattern: short-circuit on get_buffered() before the geometric recompute. Non-EPT readers set the flag from inside_buffer themselves, so they still behave the same. Adds cpp_callback_buffer_decide test helper and 5 regression assertions modelled on the existing summary/writelas pins, including the motivating boundary-coincident case.
The previous heuristic computed depth from the AOI / cube area ratio: d = ceil(0.5 * log2(target * cube_area / aoi_area)) That's correct for square AOIs but undercounts long-thin AOIs. A 280 m × 5 m strip in a 285 m × 285 m cube has ~0.5 % area, so the formula picked a low d even though the AOI already spans the cube in x — the partition would yield 1 chunk when 2+ were clearly meaningful. Replace with monotonic iteration over depth: for each candidate d, sum the integer cells covered by every partitionable AOI bbox; pick the smallest d whose total meets target_partitions, capped at min(16, max_tile_depth). At most 17 iterations per partition call, each O(num AOIs). Adds two regression tests: a long-thin AOI on the local fixture must yield >1 sub-chunk; the same thin AOI under concurrent_files(4) must return the same npoints and z_histogram as the serial read.
Replace four parallel std::vectors
std::vector<Shape*> queries;
std::vector<bool> queries_strict_clip;
std::vector<double> queries_owner_xmax;
std::vector<double> queries_owner_ymax;
with one
struct QueryRecord {
std::unique_ptr<Shape> shape;
bool strict_clip = false;
double owner_xmax = std::nan("");
double owner_ymax = std::nan("");
};
std::vector<QueryRecord> queries;
Pure refactor — no behavior change. Wins:
- The four-way size invariant that previously had to be hand-maintained
at every push (with reserve-then-push gymnastics to defend against
bad_alloc desync) is now structural: a record either exists in full
or doesn't.
- add_query overloads drop the reserve-and-push boilerplate; they just
build a QueryRecord and push_back.
- partition_ept's rebuild collapses from four parallel std::vectors
staged off-side + a final four-way swap into one std::vector<QueryRecord>
and a single swap. The atomic-commit property still holds.
- clear() and ~FileCollection() shed manual delete loops — unique_ptr
in the QueryRecord destructor handles it.
Adds one new test (118 ept tests all pass, plus remote/summarise/
writelas/callback suites verified).
The previous review-pass review noted the new thin-AOI test passed vacuously on the local fixture (max_tile_depth=1 capped both old and new depth formulas to d=1, so no end-to-end test could discriminate the new cell-count iteration from the old area-ratio formula). Extract the depth-selection logic into a static helper FileCollection::ept_pick_depth_for_aois(cube_xmin, cube_ymin, cube_size, max_tile_depth, target_partitions, aois) so the predicate is testable without standing up an EPT endpoint. partition_ept now delegates to this helper; behavior is bit-identical. Expose via cpp_ept_pick_depth in Rbinding.cpp. Adds a test that exercises depths up to 8 (single square AOI, long-thin AOI in motivating case, shallow-cap clamping, multi-AOI cap behavior, degenerate inputs).
The callback parity test added in b5bc2da covered the same four xmax edge / flag-vs-geometry rows the summary pin test starts with, but omitted the ymax-edge and circular-footprint rows the summary pin test also exercises. With inside_buffer() shared across summary, writelas, and callback, the omission was cosmetic — but parity makes the three pins read symmetrically and prevents future drift if any of them diverges in semantics.
partition_ept calls ensure_tiles() before reading the tile map; on a huge EPT (e.g. USGS Sierra Nevada, 80B points, hierarchy depth ~16) the unfiltered BFS over all sub-pages dominates startup time and can stall the engine indefinitely. The auto-partition gate at execute.cpp fires for any parallelizable EPT pipeline, so concurrent_files=N silently became unusable on large remote archives despite the AOI being known. Even though partition_ept already has the AOI rectangles in queries[], it walked the entire octree first and only filtered afterward. Make ensure_tiles take an optional list of AOI bboxes and prune subtrees whose node bbox doesn't intersect any of them. partition_ept's with-queries path now extracts clamped AOI bboxes from queries and passes them in. The no-queries path keeps the cached full-walk behavior unchanged. Pruned walks are deliberately uncached (tiles_built stays false) so a later full-walk request still enumerates everything. tiles / total_points are cleared at the top of each walk to avoid append-on-stale-partial when the BFS runs more than once. Measured on Sierra Nevada EPT, 1k-acre AOI, concurrent_files=16: total pipeline drops 459.98s -> 90.21s (5.1x). The 10k-acre AOI that previously OOM-killed at peak 40GB anon-rss in single-chunk mode now completes in 511s with peak RSS well under the ceiling (per-chunk working set ~1/64th of monolithic).
Benchmarks on USGS 3DEP AOIs showed 4×worker targets create too many small chunks at high core counts; a fixed default keeps chunk count stable while LASR_EPT_PARTITIONS still overrides when set.
push_back used critical(queue_mutex) while the drain (print_queue: iterate + clear) used critical(Rprint) -- two different locks on one global std::vector, so a worker thread's push could run concurrently with the main thread's clear/iterate and corrupt the heap. Surfaced nondeterministically (segfault, double free, malloc tcache abort) on large concurrent_files EPT reads, e.g. cf16 at ~1.1B points. Use a single lock for both push and drain, and gate the drain on the real main thread id (std::this_thread::get_id captured at load) rather than omp_get_thread_num(), which returns 0 for std::async prefetch threads outside any OpenMP team and would otherwise let them call R's C API off the main thread.
7a4b750 to
a6f40d2
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Parallel EPT acquisition
Make
concurrent_filesactually parallelize EPT reads (today: 1 endpoint = 1 chunk → falls back to serial), and make it work inside user AOIs.Three layers:
partition_eptbuilds an octree-aligned grid clipped toconf_bounds, emits one strict-clipped sub-query per non-empty cell. Gated toEPTFILE + parallelizable + !use_rcapi + ncpu_outer > 1. Hierarchy walked once at catalog build, shared with every chunk's reader.owner_xmax/ymaxso the strict-clip carve-out fires at the AOI edge → partitioned and un-partitioned AOI reads agree on boundary-coincident points. Non-rect queries pass through.GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR,GDAL_HTTP_MULTIPLEX=YES,VSI_CACHE=TRUE(64 MiB) viaCPLSetConfigOption(user env wins). Skips LASlib's per-tileVSIStatLHEAD for/ept-data/paths. Parallel hierarchy walk (capped 8 concurrent) and intra-chunk tile prefetch (std::async, best-effort). Workers never touch R's C API — warnings flow back via result structs.Also a load-bearing fix:
summary.cppandwritelas.cppnow honorPoint::get_buffered()before geometricinside_buffer— without this, strict-clip silently double-counts points on partition midlines. The fix exposed a pre-existinginside_buffer(xmin, ymin, ymax, ymax, ...)typo inreadlas.cpp/readpcd.cpp, also corrected.Tests
test-ept.R: end-to-end parallel correctness, partition shape, tiny/outside/circle/mixed AOIs, ownership-contract regression, hierarchy-failure passthrough, strict-clip boundary policy.test-remote.R: remote EPT correctness viahttpuv::startServer + staticPaths.Benchmark
On a real remote EPT — USGS 3DEP
AZ_BrawleyRillito_TL_2018(2.0 B points, native EPSG:3857) — a 10,000-acre AOI (58,636,837 points returned) is now 1.1×–1.6× faster than PDAL at every concurrency level tested.s3-us-west-2.amazonaws.com/usgs-lidar-public/AZ_BrawleyRillito_TL_2018/ept.jsonboundsConformingreader_rectangles(AOI) + summarise()(lasR) andreaders.ept + filters.stats(PDAL 2.10.1)lasR
Outer-parallelism efficiency drops past
concurrent_files(4)because the AWS S3 outbound pipe to this host saturates at ~4 concurrent fetches — PDAL hits the same wall (see below).PDAL (same AOI, same endpoint)
requests=1requests=1requests=2requests=4requests=8requests=16requests=1(4×4 grid)The 16-process PDAL variant returns 58,636,867 points — 30 extra. Each subprocess clips to its own sub-AOI without lasR's strict-clip ownership protocol, so a handful of grid-seam points get double-counted.
Head-to-head