Skip to content

ICU: per-item zstd compression of libicudata#237

Open
dylan-conway wants to merge 5 commits into
mainfrom
claude/icu-compress-data
Open

ICU: per-item zstd compression of libicudata#237
dylan-conway wants to merge 5 commits into
mainfrom
claude/icu-compress-data

Conversation

@dylan-conway
Copy link
Copy Markdown
Member

@dylan-conway dylan-conway commented May 22, 2026

Compresses ICU's five display-name trees (curr/ lang/ region/ unit/ zone/, non-en) per-item with zstd and adds a two-line hook in udata.cpp so Bun decompresses on first lookup. Everything else — collation, segmentation, locale format patterns, properties, normalization, tz rules — stays raw, so Intl.Collator/Segmenter/DateTimeFormat/NumberFormat (default), Date, URL IDNA, String.normalize, regex \p{} pay zero decompression in any locale.

What changes

  • icu/udata-decompress-hook.patch — applied after extracting the ICU tarball. Adds a weak extern "C" call between TOC lookup and checkDataItem; null in ICU's own tools, defined by Bun at link time.
  • icu/compress-data.ts — runs after the existing icupkg filter. Uses ICU's own icupkg -l/-x to read the package (no manual format parsing), trains a 128 KB zstd dictionary, compresses each item not in icu/hot-items.txt with the zstd CLI, writes the package back (UDataOffsetTOC — the one hand-rolled bit, since icupkg -a rejects non-ICU item bodies), and emits libicudata.a with the package + dict as .rodata symbols. Node stdlib + util.parseArgs; runs under Node's native type-stripping.
  • icu/hot-items.txt — everything except non-en display-name items. 1,655 raw / 2,115 compressed; largest compressed item 79 KB.
  • Dockerfile, Dockerfile.musl — install zstd + Node, apply the patch, run the repacker.

Bun-side companion

oven-sh/bun#31200Bun::ICUDecompressor singleton (the hook) + test/js/web/intl/ (30 tests, ~5,900 assertions: snapshots for 12 locales × every Intl API captured against unmodified libicudata.a, plus an exhaustive sweep that loads every compressed item). All pass identically on baseline, compressed-release, and compressed-LTO builds.

Measured (real Bun release binaries, hyperfine 50 runs)

baseline with this Δ
Stripped binary 83,741,640 B 75,385,800 B −8.4 MB (same under LTO)
bun --version 0.50 ms 0.51 ms noise
new Date().toString() 6.9 ms 6.3 ms noise
Intl.DateTimeFormat("ja") 6.1 ms 6.1 ms 0
Intl.Collator("zh") 5.3 ms 5.3 ms 0
Intl.Segmenter("zh", word) 6.6 ms 6.6 ms 0
Intl.DisplayNames("ko").of("US") 5.7 ms 5.8 ms +0.1 ms
NumberFormat("ru", {style:"unit"}) 5.9 ms 6.1 ms +0.2 ms (worst case)
DateTimeFormat("ja", {timeZoneName:"long"}) 6.1 ms 6.3 ms +0.2 ms

All Intl outputs byte-identical to baseline (中文\|分词\|测试, 미국, 1.234,56 €, 5 км, …). Regressions are first-call-only, then cached.

Memory (/proc/self/status, 102 locales × 5 trees)

baseline Δ with this Δ diff
RSS total +23.9 MB +24.2 MB +0.3 MB
RssAnon (pinned) +8.3 MB +16.5 MB +8.2 MB
RssFile (evictable) +15.6 MB +7.7 MB −7.9 MB

Total resident is flat; up to ~11.3 MB shifts evictable→pinned, reachable only via DisplayNames/unit/timeZoneName/currencyDisplay:"name" across all ~500 locales. No refcounting (would re-decompress on GC churn).

Linux-only / not in this PR

Dockerfile (glibc) and Dockerfile.musl only. build-icu.ps1 (Windows, also still on ICU 73.2), Dockerfile.android, Dockerfile.freebsd are untouched; macOS uses system libicucore. LTO validated (weak symbol resolves; baseline DCE's the unreachable hook).

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 22, 2026

Preview Builds

Commit Release Date
cbfdad14 autobuild-preview-pr-237-cbfdad14 2026-05-22 06:11:36 UTC
33ac802c autobuild-preview-pr-237-33ac802c 2026-05-22 05:26:32 UTC
72c07745 autobuild-preview-pr-237-72c07745 2026-05-22 02:07:25 UTC

@dylan-conway dylan-conway marked this pull request as ready for review May 22, 2026 05:19
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 94c7cde5-4f8f-4dda-b2fe-6a36a18eb249

📥 Commits

Reviewing files that changed from the base of the PR and between 33ac802 and cbfdad1.

📒 Files selected for processing (1)
  • icu/compress-data.ts

Walkthrough

Adds a Node.js compression tool and Docker build steps to repack ICU common-data with per-item zstd compression, a hot-item whitelist, and a weak runtime decompression hook applied by patching ICU's udata loader.

Changes

ICU data compression build and runtime

Layer / File(s) Summary
Build environment dependencies
Dockerfile, Dockerfile.musl
Adds zstd, xz-utils, Node installation, and patch to Docker build stages so the image can run the compression tool and apply source patches.
ICU data compression tool
icu/compress-data.ts
Adds a Node.js/TypeScript CLI that extracts ICU packages via icupkg, trains a zstd dictionary, conditionally compresses items into per-item zstd frames, rebuilds the ICU package binary (TOC, name pool, 16-byte alignment), verifies integrity, embeds the rebuilt .dat and dictionary into assembly, compiles with cc, and archives to .a.
Hot items configuration
icu/hot-items.txt
Lists globs and paths marking ICU items that must remain uncompressed (hot); restricts compression to the five Intl.DisplayNames trees and enumerates specific hot resource patterns.
Runtime decompression hook
icu/udata-decompress-hook.patch
Patches source/common/udata.cpp to declare a weak bun_icu_maybe_decompress symbol and, if present, pass per-item DataHeader through it to potentially decompress zstd-framed data before header checks.
Docker build orchestration
Dockerfile, Dockerfile.musl
Stages local icu/ assets into images, applies the udata decompression patch during the ICU build, generates filtered icudt75l.dat, and runs node --experimental-strip-types /icu-bun/compress-data.ts to produce the final libicudata.a, skipping entries listed in icu-bun/hot-items.txt.
🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is comprehensive and detailed, covering all changes, rationale, measurements, and companion PRs. However, it does not follow the WebKit template format (no Bugzilla reference, no formal structure). Add a Bugzilla bug reference, restructure to match the template format with commit message style, and include the standard sections (bug title, reviewer acknowledgment, explanation).
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: per-item zstd compression of ICU's libicudata library, which is the primary focus of all modifications.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.


Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant