Fix zarr ZipStore write failure for large trees (v0.2.6)#86
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #86 +/- ##
==========================================
- Coverage 92.56% 92.47% -0.10%
==========================================
Files 7 7
Lines 686 757 +71
==========================================
+ Hits 635 700 +65
- Misses 51 57 +6
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 97a7d98e2e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| _write_elem( | ||
| tg, | ||
| "node_attrs", | ||
| {k: _make_attr_array([G.nodes[n].get(k) for n in G.nodes()]) for k in node_attr_keys}, |
There was a problem hiding this comment.
Preserve explicit None tree attributes in zarr writes
Using G.nodes[n].get(k) here collapses two distinct cases—missing attribute and attribute explicitly set to None—into the same stored value. Because _read_axis_trees_zarr later drops None entries (if v[i] is not None), any node/edge attribute intentionally set to None is lost after a zarr round-trip, which is a behavior regression from the previous JSON serialization path that preserved explicit None values.
Useful? React with 👍 / 👎.
| g = f.require_group(key) | ||
| for name, G in trees.items(): | ||
| tg = g.require_group(str(name)) |
There was a problem hiding this comment.
Remove stale tree groups before writing axis trees
This writes into obst/vart with require_group but never removes existing child groups that are no longer present in trees. When users pass an existing zarr.Group and rewrite data, deleted trees remain on disk and are still returned by read_zarr, so the persisted tree set can diverge from the in-memory TreeData being written.
Useful? React with 👍 / 👎.
Won't this create |
Yes, it will. I'm not a zarr expert. Will this affect performance or file size? |
|
Yeah, that's probably going to make a lot of file systems very unhappy for large |
Stores tree node/edge attributes as columnar awkward arrays (one ak.Array per attribute key) using anndata's existing write_elem/read_elem codec. This avoids zarr's scalar string size limit that caused TypeError on large trees, stores typed data natively (float64, var * string, etc.), and eliminates the duplicate zarr.json warnings from ZipStore's append-only behavior by routing writes through an intermediate MemoryStore. Adds awkward>=2 as a dependency. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3785b2b to
8740b50
Compare
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implementation overviewThis PR fixes two distinct zarr v3 issues that surface when writing large 1.
|
| n_obs | tree nodes | peak RAM | write time |
|---|---|---|---|
| 5k | 10,005 | 5.2 MB | 1.9 s |
| 20k | 40,005 | 19.8 MB | 4.5 s |
| 100k | 200,006 | 79.9 MB | 18.0 s |
| 500k | 1,000,007 | 352.8 MB | 80.7 s |
ilan-gold
left a comment
There was a problem hiding this comment.
My questions go in this direction:
If the graph can be represented as a list of edges and nodes, and these have a fixed-length of "fields" (as in a recarray), I think you should use recarrays.
If they have variable lenghts, I would still think hard about recarrays and ptentially just have a free "string" field. I don't know the data well enough though.
It could also be good to publish your spec on https://treedata.readthedocs.io/en/latest/index.html
| nodes = [str(n) for n in G.nodes()] | ||
| edges = [(str(u), str(v)) for u, v in G.edges()] | ||
| _write_elem(tg, "nodes", np.array(nodes, dtype=object), dataset_kwargs={}) | ||
| edge_arr = np.array(edges, dtype=object) if edges else np.zeros((0, 2), dtype=object) | ||
| _write_elem(tg, "edges", edge_arr, dataset_kwargs={}) | ||
| node_attr_keys = {k for n in G.nodes() for k in G.nodes[n]} |
There was a problem hiding this comment.
- The edges seems like a good candidate for recarrays: https://numpy.org/doc/stable/reference/generated/numpy.recarray.html. 2. If edges contain the node->node relationship, do you need nodes?
| if node_attr_keys: | ||
| _write_elem( | ||
| tg, | ||
| "node_attrs", | ||
| {k: ak.from_iter([_make_serializable(G.nodes[n].get(k)) for n in G.nodes()]) for k in node_attr_keys}, | ||
| dataset_kwargs={}, | ||
| ) | ||
| edge_attr_keys = {k for u, v in G.edges() for k in G[u][v]} | ||
| if edge_attr_keys: | ||
| _write_elem( | ||
| tg, | ||
| "edge_attrs", | ||
| {k: ak.from_iter([_make_serializable(G[u][v].get(k)) for u, v in G.edges()]) for k in edge_attr_keys}, | ||
| dataset_kwargs={}, | ||
| ) |
There was a problem hiding this comment.
Are these attrs structured in any way? Could they be done with recarrays as well?
There was a problem hiding this comment.
Also if there are a lot of k, won't this make a lot of directories too?
Summary
Fixes a
TypeError: string too large to store inside arraycrash when writing aTreeDataobject with large trees to a zarrZipStore(reported by a user with ~1.8M cells).Root causes fixed
1. Scalar JSON string too large for zarr v3
The previous zarr serialization of
obst/vartconverted all trees into a single giant JSON string and stored it as a zarr scalar. zarr v3 cannot store scalar strings above a certain size, causing a hard crash for large datasets.The new format stores each
DiGraphas a zarr subgroup with columnar arrays — one 1-D array per node/edge attribute key, plus separate arrays for node IDs and edge pairs. Each array element is a small JSON-encoded string, which zarr handles without any size limit. This also handles heterogeneous and complex attribute types (list,tuple,set,np.ndarray,pd.Series, nested structures) correctly.2. Duplicate
zarr.jsonentries in ZipStorezarr v3's
ZipStoreis append-only: every attribute mutation re-writeszarr.json, producingUserWarning: Duplicate namefor every group in the file. Writes to aZipStorenow go through an intermediateMemoryStore(which supports in-place overwriting), and the final state is bulk-copied to theZipStoreexactly once per key — eliminating all duplicate-entry warnings.3. ZipStore not auto-detected on read
read_zarrnow auto-detects.zippaths and opens them asZipStoreautomatically, so users can pass a plain string path rather than a pre-opened store object.Backward compatibility
.h5td) format: unchangedPerformance (ZipStore path only)
The MemoryStore intermediate adds ~0.9 MB peak RAM per 1K cells but is actually slightly faster than writing directly to disk. The non-ZipStore zarr path (directory store) is completely unaffected.
Test plan
test_zarr_dtypes— round-tripslist,tuple,set,np.float64,np.ndarray,pd.Series, nested lists, str, int, float, booltest_zarr_edge_attrs— complex edge attribute types survive round-triptest_zarr_missing_node_attrs— nodes lacking a given attribute do not gain it asNoneafter round-triptest_zarr_zip_store— end-to-end write/read throughZipStore, zero duplicate-name warningstest_zarr_large_tree— 2047-node tree with multiple node and edge attributes🤖 Generated with Claude Code