Skip to content

Enhance EPT reading with concurrency and other performance fixes#313

Open
agrigoriev wants to merge 50 commits into
develfrom
parallel-ept-acquisition
Open

Enhance EPT reading with concurrency and other performance fixes#313
agrigoriev wants to merge 50 commits into
develfrom
parallel-ept-acquisition

Conversation

@agrigoriev

@agrigoriev agrigoriev commented May 12, 2026

Copy link
Copy Markdown
Collaborator

Parallel EPT acquisition

Make concurrent_files actually parallelize EPT reads (today: 1 endpoint = 1 chunk → falls back to serial), and make it work inside user AOIs.

Three layers:

  1. Auto-partition. partition_ept builds an octree-aligned grid clipped to conf_bounds, emits one strict-clipped sub-query per non-empty cell. Gated to EPTFILE + parallelizable + !use_rcapi + ncpu_outer > 1. Hierarchy walked once at catalog build, shared with every chunk's reader.
  2. AOI partition. Rectangle AOIs partition octree-aligned inside the AOI bbox. Sub-queries inherit the AOI's outer bbox as owner_xmax/ymax so the strict-clip carve-out fires at the AOI edge → partitioned and un-partitioned AOI reads agree on boundary-coincident points. Non-rect queries pass through.
  3. Remote-fetch tuning. Bakes GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR, GDAL_HTTP_MULTIPLEX=YES, VSI_CACHE=TRUE (64 MiB) via CPLSetConfigOption (user env wins). Skips LASlib's per-tile VSIStatL HEAD for /ept-data/ paths. Parallel hierarchy walk (capped 8 concurrent) and intra-chunk tile prefetch (std::async, best-effort). Workers never touch R's C API — warnings flow back via result structs.

Also a load-bearing fix: summary.cpp and writelas.cpp now honor Point::get_buffered() before geometric inside_buffer — without this, strict-clip silently double-counts points on partition midlines. The fix exposed a pre-existing inside_buffer(xmin, ymin, ymax, ymax, ...) typo in readlas.cpp/readpcd.cpp, also corrected.

Tests

  • test-ept.R: end-to-end parallel correctness, partition shape, tiny/outside/circle/mixed AOIs, ownership-contract regression, hierarchy-failure passthrough, strict-clip boundary policy.
  • test-remote.R: remote EPT correctness via httpuv::startServer + staticPaths.

Benchmark

On a real remote EPT — USGS 3DEP AZ_BrawleyRillito_TL_2018 (2.0 B points, native EPSG:3857) — a 10,000-acre AOI (58,636,837 points returned) is now 1.1×–1.6× faster than PDAL at every concurrency level tested.

  • Endpoint: s3-us-west-2.amazonaws.com/usgs-lidar-public/AZ_BrawleyRillito_TL_2018/ept.json
  • AOI: 10,000 acres (40.5 km², 6,361 m × 6,361 m), inside boundsConforming
  • Pipeline: reader_rectangles(AOI) + summarise() (lasR) and readers.ept + filters.stats (PDAL 2.10.1)
  • Host: 16 vCPUs
  • Returned points: 58,636,837 in every run — identical across lasR and PDAL single-process modes (correctness gate)

lasR

strategy wall (s) speedup parallel eff. Mpts/s
sequential 65.6 1.00× 100.0% 0.89
concurrent_files(2) 48.4 1.35× 67.7% 1.21
concurrent_files(4) 33.0 1.99× 49.7% 1.78
concurrent_files(8) 31.0 2.12× 26.4% 1.89
concurrent_files(16) 27.5 2.38× 14.9% 2.13

Outer-parallelism efficiency drops past concurrent_files(4) because the AWS S3 outbound pipe to this host saturates at ~4 concurrent fetches — PDAL hits the same wall (see below).

PDAL (same AOI, same endpoint)

mode wall (s) speedup vs requests=1
requests=1 104.1 1.00×
requests=2 53.0 1.96×
requests=4 44.9 2.32×
requests=8 45.4 2.29×
requests=16 45.1 2.31×
16 procs × requests=1 (4×4 grid) 42.7 2.44×

The 16-process PDAL variant returns 58,636,867 points — 30 extra. Each subprocess clips to its own sub-AOI without lasR's strict-clip ownership protocol, so a handful of grid-seam points get double-counted.

Head-to-head

concurrency PDAL lasR lasR vs PDAL
1 104 s 66 s 1.6× faster
2 53 s 48 s 1.1× faster
4 45 s 33 s 1.4× faster
8 45 s 31 s 1.5× faster
16 43 s 27 s 1.6× faster

agrigoriev added 30 commits May 8, 2026 09:17
The originally-planned counting variant (custom httpuv app$call handler)
deadlocks because httpuv R callbacks share the R event loop with the
synchronous lasR exec, so the test must use the C-native staticPaths
path. Metadata cache invariants are covered at the C++ layer in
test-ept.R via cpp_ept_partition_inspect; this test exercises the
remote parallel path end-to-end.
The previous guard ran inside the loop body (`if (d >= 16) break;`),
which still evaluated the shift `1 << 32` once when d hit 16. Move the
cap into the while condition so the shift is only evaluated for d < 16.
Unreachable from the execute.cpp path (clamp keeps d <= 6) but reachable
from cpp_ept_partition_inspect with a large target_partitions.
Lift the per-point ownership decision out of the two process() loops
into a static helper LASReptreader::strict_clip_decide that returns
DROP/CORE/BUFFERED. The two loops now call it instead of duplicating
the predicate inline.

Expose the predicate via cpp_strict_clip_decide in the test RCPP module
so R-level unit tests can exercise edge cases that integration data
rarely hits — exact partition midlines, the global-max-inclusive
carve-out, half-open xmin inclusion, zero-buffer collapse. Closes the
spec's test 5 ("strict-clip edge boundary policy") that Task 13's
plan dropped.
Strict-clip flags interior-boundary points BUFFERED, but summary.cpp and
writelas.cpp recomputed inside_buffer() geometrically — reaching the
opposite verdict at the boundary and double-counting points across
adjacent partition chunks. Short-circuit on get_buffered() before the
geometric fallback so the reader's flag wins.

Adds cpp_summary_buffer_decide / cpp_writelas_buffer_decide unit-test
helpers pinning the contract.
Adds Y-axis and circular assertions to the cpp_summary_buffer_decide /
cpp_writelas_buffer_decide blocks, plus the case the production fix
actually targets: flag=TRUE with px=xmax (non-global boundary), where
the reader marks the point BUFFERED but inside_buffer returns false.
Adds queries_owner_xmax/ymax parallel vectors and a new rect add_query
overload carrying the owner-max pair. get_chunk_with_query honors the
override when set; NaN falls through to the catalog's xmax/ymax — no
behavior change for existing callers.

The semantic of chunk.catalog_xmax/ymax is now 'outer boundary beyond
which the chunk includes the rightmost edge inclusively.' For
full-extent reads, that's the catalog. For AOI sub-queries (next
commit), it's the AOI's clamped outer bbox.
Preserves the parallel-vector size invariant when callers reset the
catalog state before re-installing queries. Latent today (no caller
exercises this path), but Task 6's partition_ept rebuild path will.
Extends the inspector to install user queries before invoking
partition_ept and reports per-chunk shape_type, owner_xmax, owner_ymax
alongside the existing bbox/strict_clip output. Lets the upcoming AOI
partition tests assert circle passthrough, owner-bounds propagation,
and mixed rect+circle behavior directly.
Lifts the queries.size() > 0 short-circuit. With-queries entry now
splits into a hierarchy-failure passthrough and a partitioning tail
(stubbed in this commit; filled in by the next commits). Adds a
clone_shape free helper used by the upcoming rebuild.

No behavior change for the no-queries path. Existing AOI reads still
behave as before (single chunk, no partitioning); two regression tests
pin circle and out-of-extent passthrough.
* Inspector now reads shape_type from queries[i]->type() (via new
  FileCollection::get_query_type accessor) rather than chunk.shape,
  which is zeroed by get_chunk_with_query for out-of-extent queries.
  The out-of-extent rect passthrough test now actually pins shape
  preservation; previously a destroyed Shape would have silently
  collapsed to "rect" via the UNKNOWN→default branch.
* clone_shape moved out of anonymous namespace to bare static,
  matching the file's existing file-local helper style
  (is_remote_path, filename_from_url).
* Restore the d<16 cap comment that explains the UB-prevention
  intent of commit 8919501.
Pins the with-queries fallback: a fixture where build_metadata succeeds
(root readable, probe tile openable) but ensure_tiles throws (a
non-root sub-page contains malformed JSON) must leave the user's AOI
intact and emit a warning, never clobber it via set_chunk_size.
The existing ept-test-multi fixture's root hierarchy contains only
positive-count leaves, so the skip_if guard fired and the test never
ran. Patch the copied root to inject a -1 sub-page pointer ourselves,
then write the malformed JSON to that path. build_metadata returns
on the first existing leaf probe; ensure_tiles walks the whole
hierarchy and now genuinely throws on the malformed sub-page.
* on.exit cleanup so the broken fixture doesn't leak across runs.
* stopifnot(all(file.copy(...))) so a partial fixture copy errors
  loudly rather than producing a green test via an unrelated
  hierarchy warning.
* Tighten the expect_match regex to bind specifically to the
  partition_ept with-queries fallback warning rather than any line
  that happens to contain "hierarchy".
For each rectangular AOI, emit one strict-clipped sub-query per
non-empty grid cell that overlaps (cell ∩ query ∩ conf_bounds). Owner
bounds (catalog_xmax/ymax) propagate from the CLAMPED aoi_bbox so the
strict-clip rightmost-edge carve-out fires at the AOI's outer edge,
not the catalog's — partitioned and un-partitioned AOI reads agree on
boundary-coincident points.

Non-rect queries, rects disjoint from conf_bounds, and rects whose
cells are all empty pass through unchanged via clone_shape. Mixed
rect+circle lists work naturally. Depth chosen so 4^d * aoi_ratio ~=
target_partitions, capped at min(16, max_tile_depth) before shifting.

Rebuild is exception-safe: replacement Shapes built off-side as
unique_ptrs; raw-pointer staging vector reserved before any catalog
mutation; commit phase is allocation-free and noexcept.
A rect AOI under concurrent-files with ncpu=4 produces identical
summary results to the serial read (exact equality, no tolerance);
the parallel write_las output (per-chunk files via glob template)
sums to the same canonical point count as the serial summarise.

Together these pin the load-bearing claim of the AOI partitioning
work: ownership contract + clamped owner bounds eliminate boundary
duplication. The serial write_las(single-file) path uses different
keep_buffer defaults so cannot serve as a comparator here — the
canonical AOI count comes from serial summarise.
… return

Move the per-query owner-bounds assignment past the chunk.clear() that
runs when no files overlap the AOI. Previously the values were set,
then immediately wiped — harmless because the engine skips empty
chunks via chunk.is_empty(), but a dead store that confused readers.

Adds doc comments to Chunk::catalog_xmax/ymax (and the mirror fields
on LASReptreader) clarifying that under AOI partitioning these hold
the AOI's clamped outer bbox rather than the catalog's, which is what
makes partition_ept and un-partitioned AOI reads agree on
boundary-coincident points.
agrigoriev and others added 17 commits May 11, 2026 13:33
Per the lasR worktree convention, each worktree installs to its own
.Rlib via 'R CMD INSTALL -l .Rlib .' to avoid clobbering sibling
worktrees that may be testing different commits in parallel.
If new_q.emplace_back needs to reallocate and throws bad_alloc, a raw
Shape* returned from clone_shape would leak. Returning unique_ptr makes
the temporary self-cleaning if the push fails — matches the design
goal of strong exception guarantee in the partition rebuild and the
pattern already used for the replacement-path's freshly-allocated
Rectangles.
Each add_query overload pushes to four parallel vectors in sequence
(queries, queries_strict_clip, queries_owner_xmax, queries_owner_ymax).
If the second push_back threw bad_alloc, queries would already have
size N+1 while the side vectors had size N — a subsequent
get_chunk_with_query(i=N) would read past queries_strict_clip and the
two NaN-tracking vectors.

Fix: reserve capacity in all four vectors before any push_back. If a
reserve throws, the catalog state is unchanged and the unique_ptr
holding the freshly-allocated Shape destroys it cleanly. After the
reserves succeed, the push_backs are guaranteed allocation-free, so
the four-element invariant holds atomically.

Also consolidates the 5-arg rect overload to delegate into the 7-arg
form rather than duplicating the push sequence.
The dense vector<int64_t> cell_points map sized 2^d × 2^d could reach
~32 GiB at d=16 — reachable from a sub-meter AOI in a deep EPT, since
the AOI-area-weighted depth formula can pick d much larger than the
no-queries path's env-clamped cap.

Two changes, scoped to the with-queries path only (the no-queries
path's d is bounded by LASR_EPT_PARTITIONS ≤ 4096 at the engine gate):

1. Float-safe depth selection: compare the desired depth against the
   integer cap in double-space before casting. log2 of a denormal
   ratio could overflow to inf, and (int)inf is UB.

2. Replace the dense int64_t grid with std::unordered_set<uint64_t>
   keyed by cy * grid_n + cx. Walk tiles only against the union of
   AOI cell ranges — shallow-tile splatting is pre-clipped to that
   union, bounding both memory and work to O(AOI footprint).

Functionally identical to the previous emission: the dense map's
values were only checked against zero, so a sparse set carries the
same information.
Streaming-mode buffer classification called inside_buffer(xmin, ymin,
ymax, ymax, ...) — the third argument should be xmax. Before the
summary/writelas ownership-contract fix this was harmless because
downstream stages ignored the flag and recomputed the geometry
themselves. After the contract fix they honor the flag, so the buggy
classification leaked into summary npoints and write_las output.

The non-streaming in-memory path at readlas.cpp:105 and readpcd.cpp:116
already passed the correct bbox — only the streaming path was wrong.
Without these defaults, every /vsicurl/ tile open does sibling-directory
HEAD probing (≈10× more HTTP requests than the actual range fetch), the
VSI byte cache is off so overlapping AOI sub-queries re-fetch the same
bytes, and HTTP/1.1 serialises range requests per host. The net effect
on concurrent_files(4) against a real remote EPT (autzen-classified on
entwine.io, 500 m × 500 m AOI, 142k points, cold cache, fresh subprocess
per run, median of 3):

  before:  serial 2.7 s warm / CF(4) 43 s — partition was 16× SLOWER
  after:   serial 20  s cold / CF(4) 12 s — partition is 1.6× FASTER

We set four config options once at add_ept_endpoint:

  GDAL_DISABLE_READDIR_ON_OPEN = EMPTY_DIR
  GDAL_HTTP_MULTIPLEX          = YES
  VSI_CACHE                    = TRUE
  VSI_CACHE_SIZE               = 67108864   # 64 MiB

CPLSetConfigOption reads through to env on get(), so any user override
(Sys.setenv or shell) wins — we only set when unset. Scope is
process-wide for subsequent VSI ops, which matches how lasR uses GDAL.
…rmance

FileCollection: suppress LASlib size warnings during EPT reads
If std::async throws std::system_error (e.g. process thread-limit hit
under heavy outer parallelism), the original code had popped the next
tile key off tile_queue *before* the throw — losing the tile from the
read. open_next_tile would then return successfully (current_tile was
already installed) but skip data on its next call.

Wrap the launch in try/catch (std::system_error); on failure push the
key back to the front of tile_queue and leave next_tile_future invalid.
The consumer ignores invalid futures, so the next open_next_tile call
falls through to the synchronous open path.
The previous patch suppressed LASlib's per-open VSIStatL HEAD warning
via a process-wide LASR_SKIP_REMOTE_SIZE_WARN env var set during EPT
add_endpoint. That env var is never cleared, so any unrelated remote
LAS read in the same R session also lost the >500 MB size warning —
a real footgun for users who read EPT first and then a big standalone
remote .laz.

Replace with a path check: LASlib skips the size HEAD only when the
path matches the EPT convention (/ept-data/). The warning stays
visible for every other remote read. The FileCollection setenv block
and cstdlib include are no longer needed and are removed.
callback.cpp:348 was the last remaining filter-decision site that
recomputed buffer status from geometry alone (summary.cpp/writelas.cpp
were fixed in 0c22b43). Under strict-clip mode, the EPT reader flags
interior-edge points as BUFFERED; inside_buffer() would call them CORE
on the inclusive xmax/ymax edge. The drop_buffer flag in callback's
"Update the LAS" loop therefore disagreed with the reader on
partition midlines.

Mirror the summary/writelas pattern: short-circuit on get_buffered()
before the geometric recompute. Non-EPT readers set the flag from
inside_buffer themselves, so they still behave the same.

Adds cpp_callback_buffer_decide test helper and 5 regression
assertions modelled on the existing summary/writelas pins, including
the motivating boundary-coincident case.
The previous heuristic computed depth from the AOI / cube area ratio:
  d = ceil(0.5 * log2(target * cube_area / aoi_area))

That's correct for square AOIs but undercounts long-thin AOIs. A
280 m × 5 m strip in a 285 m × 285 m cube has ~0.5 % area, so the
formula picked a low d even though the AOI already spans the cube in
x — the partition would yield 1 chunk when 2+ were clearly meaningful.

Replace with monotonic iteration over depth: for each candidate d,
sum the integer cells covered by every partitionable AOI bbox; pick
the smallest d whose total meets target_partitions, capped at
min(16, max_tile_depth). At most 17 iterations per partition call,
each O(num AOIs).

Adds two regression tests: a long-thin AOI on the local fixture must
yield >1 sub-chunk; the same thin AOI under concurrent_files(4) must
return the same npoints and z_histogram as the serial read.
Replace four parallel std::vectors

  std::vector<Shape*>  queries;
  std::vector<bool>    queries_strict_clip;
  std::vector<double>  queries_owner_xmax;
  std::vector<double>  queries_owner_ymax;

with one

  struct QueryRecord {
    std::unique_ptr<Shape> shape;
    bool   strict_clip = false;
    double owner_xmax  = std::nan("");
    double owner_ymax  = std::nan("");
  };
  std::vector<QueryRecord> queries;

Pure refactor — no behavior change. Wins:

- The four-way size invariant that previously had to be hand-maintained
  at every push (with reserve-then-push gymnastics to defend against
  bad_alloc desync) is now structural: a record either exists in full
  or doesn't.
- add_query overloads drop the reserve-and-push boilerplate; they just
  build a QueryRecord and push_back.
- partition_ept's rebuild collapses from four parallel std::vectors
  staged off-side + a final four-way swap into one std::vector<QueryRecord>
  and a single swap. The atomic-commit property still holds.
- clear() and ~FileCollection() shed manual delete loops — unique_ptr
  in the QueryRecord destructor handles it.

Adds one new test (118 ept tests all pass, plus remote/summarise/
writelas/callback suites verified).
The previous review-pass review noted the new thin-AOI test passed
vacuously on the local fixture (max_tile_depth=1 capped both old and
new depth formulas to d=1, so no end-to-end test could discriminate
the new cell-count iteration from the old area-ratio formula).

Extract the depth-selection logic into a static helper
FileCollection::ept_pick_depth_for_aois(cube_xmin, cube_ymin,
cube_size, max_tile_depth, target_partitions, aois) so the predicate
is testable without standing up an EPT endpoint. partition_ept now
delegates to this helper; behavior is bit-identical.

Expose via cpp_ept_pick_depth in Rbinding.cpp. Adds a test that
exercises depths up to 8 (single square AOI, long-thin AOI in
motivating case, shallow-cap clamping, multi-AOI cap behavior,
degenerate inputs).
The callback parity test added in b5bc2da covered the same four xmax
edge / flag-vs-geometry rows the summary pin test starts with, but
omitted the ymax-edge and circular-footprint rows the summary pin
test also exercises. With inside_buffer() shared across summary,
writelas, and callback, the omission was cosmetic — but parity makes
the three pins read symmetrically and prevents future drift if any of
them diverges in semantics.
@agrigoriev agrigoriev requested a review from Jean-Romain May 12, 2026 17:44
@agrigoriev agrigoriev marked this pull request as ready for review May 12, 2026 17:44
@agrigoriev agrigoriev removed the request for review from Jean-Romain May 13, 2026 10:10
@agrigoriev agrigoriev marked this pull request as draft May 13, 2026 10:11
partition_ept calls ensure_tiles() before reading the tile map; on a
huge EPT (e.g. USGS Sierra Nevada, 80B points, hierarchy depth ~16) the
unfiltered BFS over all sub-pages dominates startup time and can stall
the engine indefinitely. The auto-partition gate at execute.cpp fires
for any parallelizable EPT pipeline, so concurrent_files=N silently
became unusable on large remote archives despite the AOI being known.

Even though partition_ept already has the AOI rectangles in queries[],
it walked the entire octree first and only filtered afterward.

Make ensure_tiles take an optional list of AOI bboxes and prune subtrees
whose node bbox doesn't intersect any of them. partition_ept's
with-queries path now extracts clamped AOI bboxes from queries and
passes them in. The no-queries path keeps the cached full-walk
behavior unchanged.

Pruned walks are deliberately uncached (tiles_built stays false) so a
later full-walk request still enumerates everything. tiles /
total_points are cleared at the top of each walk to avoid
append-on-stale-partial when the BFS runs more than once.

Measured on Sierra Nevada EPT, 1k-acre AOI, concurrent_files=16:
total pipeline drops 459.98s -> 90.21s (5.1x).
The 10k-acre AOI that previously OOM-killed at peak 40GB anon-rss in
single-chunk mode now completes in 511s with peak RSS well under the
ceiling (per-chunk working set ~1/64th of monolithic).
@agrigoriev agrigoriev marked this pull request as ready for review May 13, 2026 11:00
@agrigoriev agrigoriev requested a review from Jean-Romain May 13, 2026 11:00
Benchmarks on USGS 3DEP AOIs showed 4×worker targets create too many small

chunks at high core counts; a fixed default keeps chunk count stable while

LASR_EPT_PARTITIONS still overrides when set.
push_back used critical(queue_mutex) while the drain (print_queue: iterate +
clear) used critical(Rprint) -- two different locks on one global std::vector,
so a worker thread's push could run concurrently with the main thread's
clear/iterate and corrupt the heap. Surfaced nondeterministically (segfault,
double free, malloc tcache abort) on large concurrent_files EPT reads, e.g.
cf16 at ~1.1B points.

Use a single lock for both push and drain, and gate the drain on the real main
thread id (std::this_thread::get_id captured at load) rather than
omp_get_thread_num(), which returns 0 for std::async prefetch threads outside
any OpenMP team and would otherwise let them call R's C API off the main thread.
@agrigoriev agrigoriev force-pushed the parallel-ept-acquisition branch from 7a4b750 to a6f40d2 Compare May 29, 2026 08:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant