We have a mess in the DMR++ Array code

From @jgallagher59701, copied out of NGAP ticket:

The first issue was the root cause of the bug we found on test.o.o where some accesses just spun forever. Look at and plan to fix the remaining issues. [HYRAX-2130](https://bugs.earthdata.nasa.gov/browse/HYRAX-2130) addresses the first problem.

Note: The following analysis is not perfect, but it sheds light on some long standing problems. This is from Codex using GPT 5.5 set to 'Extra High' (FWIW):

I did an analysis-only pass. No files changed, no configure/build/test run. I used the current tree under `/Users/jimg/src/opendap/hyrax/bes`; no dependency paths or configure flags were involved.

Your suspicion is plausible, but the biggest issue may be adjacent to the decompressor rather than inside zlib.

*Main Findings*

1. `DMRPP.MaxComputeThreads` is read as a boolean, not an integer.
In [DmrppRequestHandler.cc](/Users/jimg/src/opendap/hyrax/bes/modules/dmrpp_module/DmrppRequestHandler.cc:161), `d_max_compute_threads` uses `TheBESKeys::read_bool_key()`. That means:

unset default `8` becomes `true`, then `1`
configured `DMRPP.MaxComputeThreads = 8` becomes `false`, then `0`
with compute threads enabled and max set to `0`, `process_chunks_*_concurrent()` can spin forever because no future is started and the chunk queue never drains.

This is the most concrete bug I found. It can also hide real parallel decompression bugs because the default effective parallelism is probably only one worker.

2. `Chunk` read/decompress state is mutable and shared through copied arrays.
[DmrppArray](/Users/jimg/src/opendap/hyrax/bes/modules/dmrpp_module/DmrppArray.h:201) and [DmrppCommon](/Users/jimg/src/opendap/hyrax/bes/modules/dmrpp_module/DmrppCommon.h:168) use default copy behavior, while `DmrppCommon` stores `std::shared_ptr<Chunk>` in [d_chunks](/Users/jimg/src/opendap/hyrax/bes/modules/dmrpp_module/DmrppCommon.h:107). `SuperChunk` and `Chunk::filter_chunk()` mutate those `Chunk` objects during reads.

So if duplicated/cached DMR++ arrays are read concurrently, two requests can race on the same `Chunk`’s `d_is_read`, `d_is_inflated`, `d_read_buffer`, and ownership flag. That is a real concurrent decompression hazard.

3. `SuperChunk` leaves child chunks pointing into ephemeral aggregate storage.
`SuperChunk::map_chunks_to_buffer()` maps each child chunk into `d_read_buffer` at [SuperChunk.cc](/Users/jimg/src/opendap/hyrax/bes/modules/dmrpp_module/SuperChunk.cc:619). Later `retrieve_data()` marks each child chunk read at [SuperChunk.cc](/Users/jimg/src/opendap/hyrax/bes/modules/dmrpp_module/SuperChunk.cc:758). But `SuperChunk` frees `d_read_buffer` in its destructor [SuperChunk.h](/Users/jimg/src/opendap/hyrax/bes/modules/dmrpp_module/SuperChunk.h:72).

For deflate/shuffle chunks, `filter_chunk()` usually replaces the buffer with owned decompressed memory. For unfiltered chunks, and likely fletcher-only chunks, the child chunk can be left with `d_is_read=true` and a dangling non-owned buffer after the `SuperChunk` dies. That becomes dangerous with shallow-copied chunks or any later re-read path.

4. Exception handling corrupts global compute-thread accounting.
`get_next_future()` clears all futures and sets the passed counter to zero on any exception at [DmrppArray.cc](/Users/jimg/src/opendap/hyrax/bes/modules/dmrpp_module/DmrppArray.cc:123). That counter is the global `chunk_processing_thread_counter` from [SuperChunk.cc](/Users/jimg/src/opendap/hyrax/bes/modules/dmrpp_module/SuperChunk.cc:51). One failed request can therefore reset accounting for unrelated in-flight requests.

5. Inside one normal, uncopied array read, the per-chunk decompression path mostly looks structurally safe.
Each worker gets one chunk, zlib state is local, the recursive address vectors are local, and destination writes should be disjoint for well-formed chunk metadata. So I would not start by blaming zlib or `memcpy` in the single-request path.

*Recommended Fix Order*

1. Fix `DMRPP.MaxComputeThreads` to use `read_ulong_key()` and reject/coerce zero when compute threads are enabled.
2. Stop sharing mutable `Chunk` read/decompression state across `DmrppArray` copies. Deep-copy chunks or move per-read buffers/state out of `Chunk`.
3. Add cleanup/reset after `SuperChunk` processing so child chunks do not retain pointers into freed aggregate buffers.
4. Replace the global counter mutation with an RAII-style scoped token/semaphore; never reset the global counter to zero on one future’s exception.
5. Add focused tests for numeric `DMRPP.MaxComputeThreads`, duplicated-array concurrent reads, and unfiltered SuperChunk reuse.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

We have a mess in the DMR++ Array code #1352

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

We have a mess in the DMR++ Array code #1352

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions