Skip to content

We have a mess in the DMR++ Array code #1352

@hannahilea

Description

@hannahilea

From @jgallagher59701, copied out of NGAP ticket:

The first issue was the root cause of the bug we found on test.o.o where some accesses just spun forever. Look at and plan to fix the remaining issues. HYRAX-2130 addresses the first problem.

Note: The following analysis is not perfect, but it sheds light on some long standing problems. This is from Codex using GPT 5.5 set to 'Extra High' (FWIW):

I did an analysis-only pass. No files changed, no configure/build/test run. I used the current tree under /Users/jimg/src/opendap/hyrax/bes; no dependency paths or configure flags were involved.

Your suspicion is plausible, but the biggest issue may be adjacent to the decompressor rather than inside zlib.

Main Findings

  1. DMRPP.MaxComputeThreads is read as a boolean, not an integer.
    In DmrppRequestHandler.cc, d_max_compute_threads uses TheBESKeys::read_bool_key(). That means:

unset default 8 becomes true, then 1
configured DMRPP.MaxComputeThreads = 8 becomes false, then 0
with compute threads enabled and max set to 0, process_chunks_*_concurrent() can spin forever because no future is started and the chunk queue never drains.

This is the most concrete bug I found. It can also hide real parallel decompression bugs because the default effective parallelism is probably only one worker.

  1. Chunk read/decompress state is mutable and shared through copied arrays.
    DmrppArray and DmrppCommon use default copy behavior, while DmrppCommon stores std::shared_ptr<Chunk> in d_chunks. SuperChunk and Chunk::filter_chunk() mutate those Chunk objects during reads.

So if duplicated/cached DMR++ arrays are read concurrently, two requests can race on the same Chunk’s d_is_read, d_is_inflated, d_read_buffer, and ownership flag. That is a real concurrent decompression hazard.

  1. SuperChunk leaves child chunks pointing into ephemeral aggregate storage.
    SuperChunk::map_chunks_to_buffer() maps each child chunk into d_read_buffer at SuperChunk.cc. Later retrieve_data() marks each child chunk read at SuperChunk.cc. But SuperChunk frees d_read_buffer in its destructor SuperChunk.h.

For deflate/shuffle chunks, filter_chunk() usually replaces the buffer with owned decompressed memory. For unfiltered chunks, and likely fletcher-only chunks, the child chunk can be left with d_is_read=true and a dangling non-owned buffer after the SuperChunk dies. That becomes dangerous with shallow-copied chunks or any later re-read path.

  1. Exception handling corrupts global compute-thread accounting.
    get_next_future() clears all futures and sets the passed counter to zero on any exception at DmrppArray.cc. That counter is the global chunk_processing_thread_counter from SuperChunk.cc. One failed request can therefore reset accounting for unrelated in-flight requests.

  2. Inside one normal, uncopied array read, the per-chunk decompression path mostly looks structurally safe.
    Each worker gets one chunk, zlib state is local, the recursive address vectors are local, and destination writes should be disjoint for well-formed chunk metadata. So I would not start by blaming zlib or memcpy in the single-request path.

Recommended Fix Order

  1. Fix DMRPP.MaxComputeThreads to use read_ulong_key() and reject/coerce zero when compute threads are enabled.
  2. Stop sharing mutable Chunk read/decompression state across DmrppArray copies. Deep-copy chunks or move per-read buffers/state out of Chunk.
  3. Add cleanup/reset after SuperChunk processing so child chunks do not retain pointers into freed aggregate buffers.
  4. Replace the global counter mutation with an RAII-style scoped token/semaphore; never reset the global counter to zero on one future’s exception.
  5. Add focused tests for numeric DMRPP.MaxComputeThreads, duplicated-array concurrent reads, and unfiltered SuperChunk reuse.

Metadata

Metadata

Labels

No labels
No labels

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions