Fix zarr ZipStore write failure for large trees (v0.2.6) by colganwi · Pull Request #86 · YosefLab/treedata

colganwi · 2026-04-21T17:26:56Z

Summary

Fixes a TypeError: string too large to store inside array crash when writing a TreeData object with large trees to a zarr ZipStore (reported by a user with ~1.8M cells).

Root causes fixed

1. Scalar JSON string too large for zarr v3
The previous zarr serialization of obst/vart converted all trees into a single giant JSON string and stored it as a zarr scalar. zarr v3 cannot store scalar strings above a certain size, causing a hard crash for large datasets.

The new format stores each DiGraph as a zarr subgroup with columnar arrays — one 1-D array per node/edge attribute key, plus separate arrays for node IDs and edge pairs. Each array element is a small JSON-encoded string, which zarr handles without any size limit. This also handles heterogeneous and complex attribute types (list, tuple, set, np.ndarray, pd.Series, nested structures) correctly.

2. Duplicate zarr.json entries in ZipStore
zarr v3's ZipStore is append-only: every attribute mutation re-writes zarr.json, producing UserWarning: Duplicate name for every group in the file. Writes to a ZipStore now go through an intermediate MemoryStore (which supports in-place overwriting), and the final state is bulk-copied to the ZipStore exactly once per key — eliminating all duplicate-entry warnings.

3. ZipStore not auto-detected on read
read_zarr now auto-detects .zip paths and opens them as ZipStore automatically, so users can pass a plain string path rather than a pre-opened store object.

Backward compatibility

HDF5 (.h5td) format: unchanged
Old zarr files written with the previous scalar-string format: still read correctly (format detected at read time)
New zarr files use a group-based format that is not readable by treedata < 0.2.6

Performance (ZipStore path only)

cells	tree nodes	peak extra RAM	write time
5K	10K	5.6 MB	1.7s
20K	40K	21.2 MB	4.2s
100K	200K	105.9 MB	17.3s
500K	1M	441 MB	78.5s

The MemoryStore intermediate adds ~0.9 MB peak RAM per 1K cells but is actually slightly faster than writing directly to disk. The non-ZipStore zarr path (directory store) is completely unaffected.

Test plan

All existing h5td and zarr tests pass
test_zarr_dtypes — round-trips list, tuple, set, np.float64, np.ndarray, pd.Series, nested lists, str, int, float, bool
test_zarr_edge_attrs — complex edge attribute types survive round-trip
test_zarr_missing_node_attrs — nodes lacking a given attribute do not gain it as None after round-trip
test_zarr_zip_store — end-to-end write/read through ZipStore, zero duplicate-name warnings
test_zarr_large_tree — 2047-node tree with multiple node and edge attributes

🤖 Generated with Claude Code

codecov · 2026-04-21T17:27:52Z

Codecov Report

❌ Patch coverage is 92.40506% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.47%. Comparing base (0091e87) to head (0566fc5).

Files with missing lines	Patch %	Lines
src/treedata/_core/write.py	91.11%	4 Missing ⚠️
src/treedata/_core/read.py	94.11%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #86      +/-   ##
==========================================
- Coverage   92.56%   92.47%   -0.10%     
==========================================
  Files           7        7              
  Lines         686      757      +71     
==========================================
+ Hits          635      700      +65     
- Misses         51       57       +6

Files with missing lines	Coverage Δ
src/treedata/_core/read.py	`87.50% <94.11%> (+2.27%)`	⬆️
src/treedata/_core/write.py	`93.79% <91.11%> (-1.76%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 97a7d98e2e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-21T17:30:48Z

+            _write_elem(
+                tg,
+                "node_attrs",
+                {k: _make_attr_array([G.nodes[n].get(k) for n in G.nodes()]) for k in node_attr_keys},


Preserve explicit None tree attributes in zarr writes

Using G.nodes[n].get(k) here collapses two distinct cases—missing attribute and attribute explicitly set to None—into the same stored value. Because _read_axis_trees_zarr later drops None entries (if v[i] is not None), any node/edge attribute intentionally set to None is lost after a zarr round-trip, which is a behavior regression from the previous JSON serialization path that preserved explicit None values.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-21T17:30:49Z

+    g = f.require_group(key)
+    for name, G in trees.items():
+        tg = g.require_group(str(name))


Remove stale tree groups before writing axis trees

This writes into obst/vart with require_group but never removes existing child groups that are no longer present in trees. When users pass an existing zarr.Group and rewrite data, deleted trees remain on disk and are still returned by read_zarr, so the persisted tree set can diverge from the in-memory TreeData being written.

Useful? React with 👍 / 👎.

ilan-gold · 2026-04-21T17:46:04Z

one 1-D array per node/edge attribute key, plus separate arrays for node IDs and edge pairs. Each array element is a small JSON-encoded string, which zarr handles without any size limit. This also handles heterogeneous and complex attribute types (list, tuple, set, np.ndarray, pd.Series, nested structures) correctly.

Won't this create n_obs number of directories?

colganwi · 2026-04-21T17:48:49Z

Won't this create n_obs number of directories?

Yes, it will. I'm not a zarr expert. Will this affect performance or file size?

ilan-gold · 2026-04-21T17:53:13Z

Yeah, that's probably going to make a lot of file systems very unhappy for large n_obs, I think

Stores tree node/edge attributes as columnar awkward arrays (one ak.Array per attribute key) using anndata's existing write_elem/read_elem codec. This avoids zarr's scalar string size limit that caused TypeError on large trees, stores typed data natively (float64, var * string, etc.), and eliminates the duplicate zarr.json warnings from ZipStore's append-only behavior by routing writes through an intermediate MemoryStore. Adds awkward>=2 as a dependency. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

colganwi · 2026-04-21T22:20:51Z

Implementation overview

This PR fixes two distinct zarr v3 issues that surface when writing large TreeData objects to a .zarr.zip file.

1. `TypeError: string too large` on large trees

Root cause: The old format serialized each graph (nodes + edges + all attributes) as a single JSON scalar stored in zarr. zarr v3 has a hard limit on scalar string size, which is exceeded by trees with ~1M+ nodes.

Fix: Tree attributes are now stored in a columnar group layout — one array per attribute key — using anndata's existing awkward-array codec (write_elem / read_elem). Each attribute column is built with ak.from_iter, which infers native types (float64, int64, var * string, option types for sparse attributes, etc.) without any JSON encoding. The zarr on-disk structure for a tree named lineage now looks like:

obst/
  lineage/
    nodes          # 1D object array of node IDs
    edges          # (E, 2) object array of (src, dst) pairs
    node_attrs/
      depth        # ak.Array, int64
      chars        # ak.Array, var * string  (ragged)
      ...
    edge_attrs/
      length       # ak.Array, float64
      ...

Backward compatibility: _decode_attr_column in read.py handles both the new ak.Array columns and the old JSON-string numpy arrays, so existing files written with the previous format are still readable.

2. `UserWarning: Duplicate name: 'zarr.json'` spam

Root cause: zarr v3's ZipStore is append-only at the zip level. Every time zarr writes or updates metadata (zarr.json), it appends a new entry to the zip file rather than overwriting the previous one. A single write_zarr call triggers dozens of such updates.

Fix: _write_zarr_via_memory writes the full dataset to an in-memory MemoryStore first (which supports overwriting), then bulk-copies the final state to the ZipStore in one pass via zarr's async store.set() API. Each key is written exactly once.

Performance

Awkward arrays use ~25% less peak RAM than the previous JSON-string approach (typed buffers pack tighter). Write time is roughly equivalent end-to-end because the expression matrix (X) dominates:

n_obs	tree nodes	peak RAM	write time
5k	10,005	5.2 MB	1.9 s
20k	40,005	19.8 MB	4.5 s
100k	200,006	79.9 MB	18.0 s
500k	1,000,007	352.8 MB	80.7 s

ilan-gold

My questions go in this direction:

If the graph can be represented as a list of edges and nodes, and these have a fixed-length of "fields" (as in a recarray), I think you should use recarrays.

If they have variable lenghts, I would still think hard about recarrays and ptentially just have a free "string" field. I don't know the data well enough though.

It could also be good to publish your spec on https://treedata.readthedocs.io/en/latest/index.html

ilan-gold · 2026-04-22T08:50:55Z

+        nodes = [str(n) for n in G.nodes()]
+        edges = [(str(u), str(v)) for u, v in G.edges()]
+        _write_elem(tg, "nodes", np.array(nodes, dtype=object), dataset_kwargs={})
+        edge_arr = np.array(edges, dtype=object) if edges else np.zeros((0, 2), dtype=object)
+        _write_elem(tg, "edges", edge_arr, dataset_kwargs={})
+        node_attr_keys = {k for n in G.nodes() for k in G.nodes[n]}


The edges seems like a good candidate for recarrays: https://numpy.org/doc/stable/reference/generated/numpy.recarray.html. 2. If edges contain the node->node relationship, do you need nodes?

ilan-gold · 2026-04-22T08:51:47Z

+        if node_attr_keys:
+            _write_elem(
+                tg,
+                "node_attrs",
+                {k: ak.from_iter([_make_serializable(G.nodes[n].get(k)) for n in G.nodes()]) for k in node_attr_keys},
+                dataset_kwargs={},
+            )
+        edge_attr_keys = {k for u, v in G.edges() for k in G[u][v]}
+        if edge_attr_keys:
+            _write_elem(
+                tg,
+                "edge_attrs",
+                {k: ak.from_iter([_make_serializable(G[u][v].get(k)) for u, v in G.edges()]) for k in edge_attr_keys},
+                dataset_kwargs={},
+            )


Are these attrs structured in any way? Could they be done with recarrays as well?

Also if there are a lot of k, won't this make a lot of directories too?

chatgpt-codex-connector Bot reviewed Apr 21, 2026

View reviewed changes

colganwi mentioned this pull request Apr 21, 2026

String too large to store inside array when writing Zarr store #85

Open

colganwi force-pushed the fix/zarr-zipstore-large-trees branch from 3785b2b to 8740b50 Compare April 21, 2026 22:15

Move _write_zarr_via_memory before write_zarr

0566fc5

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ilan-gold reviewed Apr 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix zarr ZipStore write failure for large trees (v0.2.6)#86

Fix zarr ZipStore write failure for large trees (v0.2.6)#86
colganwi wants to merge 2 commits into
mainfrom
fix/zarr-zipstore-large-trees

colganwi commented Apr 21, 2026

Uh oh!

codecov Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 21, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 21, 2026

Uh oh!

ilan-gold commented Apr 21, 2026 •

edited

Loading

Uh oh!

colganwi commented Apr 21, 2026

Uh oh!

ilan-gold commented Apr 21, 2026

Uh oh!

colganwi commented Apr 21, 2026

Uh oh!

ilan-gold left a comment

Uh oh!

ilan-gold Apr 22, 2026

Uh oh!

ilan-gold Apr 22, 2026

Uh oh!

ilan-gold Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

colganwi commented Apr 21, 2026

Summary

Root causes fixed

Backward compatibility

Performance (ZipStore path only)

Test plan

Uh oh!

codecov Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

ilan-gold commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

colganwi commented Apr 21, 2026

Uh oh!

ilan-gold commented Apr 21, 2026

Uh oh!

colganwi commented Apr 21, 2026

Implementation overview

1. TypeError: string too large on large trees

2. UserWarning: Duplicate name: 'zarr.json' spam

Performance

Uh oh!

ilan-gold left a comment

Choose a reason for hiding this comment

Uh oh!

ilan-gold Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

ilan-gold Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

ilan-gold Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Apr 21, 2026 •

edited

Loading

ilan-gold commented Apr 21, 2026 •

edited

Loading

1. `TypeError: string too large` on large trees

2. `UserWarning: Duplicate name: 'zarr.json'` spam