Skip to content

Fix zarr ZipStore write failure for large trees (v0.2.6)#86

Open
colganwi wants to merge 2 commits into
mainfrom
fix/zarr-zipstore-large-trees
Open

Fix zarr ZipStore write failure for large trees (v0.2.6)#86
colganwi wants to merge 2 commits into
mainfrom
fix/zarr-zipstore-large-trees

Conversation

@colganwi
Copy link
Copy Markdown
Collaborator

Summary

Fixes a TypeError: string too large to store inside array crash when writing a TreeData object with large trees to a zarr ZipStore (reported by a user with ~1.8M cells).

Root causes fixed

1. Scalar JSON string too large for zarr v3
The previous zarr serialization of obst/vart converted all trees into a single giant JSON string and stored it as a zarr scalar. zarr v3 cannot store scalar strings above a certain size, causing a hard crash for large datasets.

The new format stores each DiGraph as a zarr subgroup with columnar arrays — one 1-D array per node/edge attribute key, plus separate arrays for node IDs and edge pairs. Each array element is a small JSON-encoded string, which zarr handles without any size limit. This also handles heterogeneous and complex attribute types (list, tuple, set, np.ndarray, pd.Series, nested structures) correctly.

2. Duplicate zarr.json entries in ZipStore
zarr v3's ZipStore is append-only: every attribute mutation re-writes zarr.json, producing UserWarning: Duplicate name for every group in the file. Writes to a ZipStore now go through an intermediate MemoryStore (which supports in-place overwriting), and the final state is bulk-copied to the ZipStore exactly once per key — eliminating all duplicate-entry warnings.

3. ZipStore not auto-detected on read
read_zarr now auto-detects .zip paths and opens them as ZipStore automatically, so users can pass a plain string path rather than a pre-opened store object.

Backward compatibility

  • HDF5 (.h5td) format: unchanged
  • Old zarr files written with the previous scalar-string format: still read correctly (format detected at read time)
  • New zarr files use a group-based format that is not readable by treedata < 0.2.6

Performance (ZipStore path only)

cells tree nodes peak extra RAM write time
5K 10K 5.6 MB 1.7s
20K 40K 21.2 MB 4.2s
100K 200K 105.9 MB 17.3s
500K 1M 441 MB 78.5s

The MemoryStore intermediate adds ~0.9 MB peak RAM per 1K cells but is actually slightly faster than writing directly to disk. The non-ZipStore zarr path (directory store) is completely unaffected.

Test plan

  • All existing h5td and zarr tests pass
  • test_zarr_dtypes — round-trips list, tuple, set, np.float64, np.ndarray, pd.Series, nested lists, str, int, float, bool
  • test_zarr_edge_attrs — complex edge attribute types survive round-trip
  • test_zarr_missing_node_attrs — nodes lacking a given attribute do not gain it as None after round-trip
  • test_zarr_zip_store — end-to-end write/read through ZipStore, zero duplicate-name warnings
  • test_zarr_large_tree — 2047-node tree with multiple node and edge attributes

🤖 Generated with Claude Code

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 21, 2026

Codecov Report

❌ Patch coverage is 92.40506% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.47%. Comparing base (0091e87) to head (0566fc5).

Files with missing lines Patch % Lines
src/treedata/_core/write.py 91.11% 4 Missing ⚠️
src/treedata/_core/read.py 94.11% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #86      +/-   ##
==========================================
- Coverage   92.56%   92.47%   -0.10%     
==========================================
  Files           7        7              
  Lines         686      757      +71     
==========================================
+ Hits          635      700      +65     
- Misses         51       57       +6     
Files with missing lines Coverage Δ
src/treedata/_core/read.py 87.50% <94.11%> (+2.27%) ⬆️
src/treedata/_core/write.py 93.79% <91.11%> (-1.76%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 97a7d98e2e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/treedata/_core/write.py Outdated
_write_elem(
tg,
"node_attrs",
{k: _make_attr_array([G.nodes[n].get(k) for n in G.nodes()]) for k in node_attr_keys},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve explicit None tree attributes in zarr writes

Using G.nodes[n].get(k) here collapses two distinct cases—missing attribute and attribute explicitly set to None—into the same stored value. Because _read_axis_trees_zarr later drops None entries (if v[i] is not None), any node/edge attribute intentionally set to None is lost after a zarr round-trip, which is a behavior regression from the previous JSON serialization path that preserved explicit None values.

Useful? React with 👍 / 👎.

Comment on lines +81 to +83
g = f.require_group(key)
for name, G in trees.items():
tg = g.require_group(str(name))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Remove stale tree groups before writing axis trees

This writes into obst/vart with require_group but never removes existing child groups that are no longer present in trees. When users pass an existing zarr.Group and rewrite data, deleted trees remain on disk and are still returned by read_zarr, so the persisted tree set can diverge from the in-memory TreeData being written.

Useful? React with 👍 / 👎.

@ilan-gold
Copy link
Copy Markdown

ilan-gold commented Apr 21, 2026

one 1-D array per node/edge attribute key, plus separate arrays for node IDs and edge pairs. Each array element is a small JSON-encoded string, which zarr handles without any size limit. This also handles heterogeneous and complex attribute types (list, tuple, set, np.ndarray, pd.Series, nested structures) correctly.

Won't this create n_obs number of directories?

@colganwi
Copy link
Copy Markdown
Collaborator Author

Won't this create n_obs number of directories?

Yes, it will. I'm not a zarr expert. Will this affect performance or file size?

@ilan-gold
Copy link
Copy Markdown

Yeah, that's probably going to make a lot of file systems very unhappy for large n_obs, I think

Stores tree node/edge attributes as columnar awkward arrays (one ak.Array
per attribute key) using anndata's existing write_elem/read_elem codec.
This avoids zarr's scalar string size limit that caused TypeError on large
trees, stores typed data natively (float64, var * string, etc.), and
eliminates the duplicate zarr.json warnings from ZipStore's append-only
behavior by routing writes through an intermediate MemoryStore.

Adds awkward>=2 as a dependency.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@colganwi colganwi force-pushed the fix/zarr-zipstore-large-trees branch from 3785b2b to 8740b50 Compare April 21, 2026 22:15
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@colganwi
Copy link
Copy Markdown
Collaborator Author

Implementation overview

This PR fixes two distinct zarr v3 issues that surface when writing large TreeData objects to a .zarr.zip file.

1. TypeError: string too large on large trees

Root cause: The old format serialized each graph (nodes + edges + all attributes) as a single JSON scalar stored in zarr. zarr v3 has a hard limit on scalar string size, which is exceeded by trees with ~1M+ nodes.

Fix: Tree attributes are now stored in a columnar group layout — one array per attribute key — using anndata's existing awkward-array codec (write_elem / read_elem). Each attribute column is built with ak.from_iter, which infers native types (float64, int64, var * string, option types for sparse attributes, etc.) without any JSON encoding. The zarr on-disk structure for a tree named lineage now looks like:

obst/
  lineage/
    nodes          # 1D object array of node IDs
    edges          # (E, 2) object array of (src, dst) pairs
    node_attrs/
      depth        # ak.Array, int64
      chars        # ak.Array, var * string  (ragged)
      ...
    edge_attrs/
      length       # ak.Array, float64
      ...

Backward compatibility: _decode_attr_column in read.py handles both the new ak.Array columns and the old JSON-string numpy arrays, so existing files written with the previous format are still readable.

2. UserWarning: Duplicate name: 'zarr.json' spam

Root cause: zarr v3's ZipStore is append-only at the zip level. Every time zarr writes or updates metadata (zarr.json), it appends a new entry to the zip file rather than overwriting the previous one. A single write_zarr call triggers dozens of such updates.

Fix: _write_zarr_via_memory writes the full dataset to an in-memory MemoryStore first (which supports overwriting), then bulk-copies the final state to the ZipStore in one pass via zarr's async store.set() API. Each key is written exactly once.

Performance

Awkward arrays use ~25% less peak RAM than the previous JSON-string approach (typed buffers pack tighter). Write time is roughly equivalent end-to-end because the expression matrix (X) dominates:

n_obs tree nodes peak RAM write time
5k 10,005 5.2 MB 1.9 s
20k 40,005 19.8 MB 4.5 s
100k 200,006 79.9 MB 18.0 s
500k 1,000,007 352.8 MB 80.7 s

Copy link
Copy Markdown

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My questions go in this direction:

If the graph can be represented as a list of edges and nodes, and these have a fixed-length of "fields" (as in a recarray), I think you should use recarrays.

If they have variable lenghts, I would still think hard about recarrays and ptentially just have a free "string" field. I don't know the data well enough though.

It could also be good to publish your spec on https://treedata.readthedocs.io/en/latest/index.html

Comment on lines +80 to +85
nodes = [str(n) for n in G.nodes()]
edges = [(str(u), str(v)) for u, v in G.edges()]
_write_elem(tg, "nodes", np.array(nodes, dtype=object), dataset_kwargs={})
edge_arr = np.array(edges, dtype=object) if edges else np.zeros((0, 2), dtype=object)
_write_elem(tg, "edges", edge_arr, dataset_kwargs={})
node_attr_keys = {k for n in G.nodes() for k in G.nodes[n]}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The edges seems like a good candidate for recarrays: https://numpy.org/doc/stable/reference/generated/numpy.recarray.html. 2. If edges contain the node->node relationship, do you need nodes?

Comment on lines +86 to +100
if node_attr_keys:
_write_elem(
tg,
"node_attrs",
{k: ak.from_iter([_make_serializable(G.nodes[n].get(k)) for n in G.nodes()]) for k in node_attr_keys},
dataset_kwargs={},
)
edge_attr_keys = {k for u, v in G.edges() for k in G[u][v]}
if edge_attr_keys:
_write_elem(
tg,
"edge_attrs",
{k: ak.from_iter([_make_serializable(G[u][v].get(k)) for u, v in G.edges()]) for k in edge_attr_keys},
dataset_kwargs={},
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these attrs structured in any way? Could they be done with recarrays as well?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if there are a lot of k, won't this make a lot of directories too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants