ICU: per-item zstd compression of libicudata#237
Conversation
Preview Builds
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughAdds a Node.js compression tool and Docker build steps to repack ICU common-data with per-item zstd compression, a hot-item whitelist, and a weak runtime decompression hook applied by patching ICU's udata loader. ChangesICU data compression build and runtime
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Comment |
Compresses ICU's five display-name trees (
curr/ lang/ region/ unit/ zone/, non-en) per-item with zstd and adds a two-line hook inudata.cppso Bun decompresses on first lookup. Everything else — collation, segmentation, locale format patterns, properties, normalization, tz rules — stays raw, soIntl.Collator/Segmenter/DateTimeFormat/NumberFormat(default),Date, URL IDNA,String.normalize, regex\p{}pay zero decompression in any locale.What changes
icu/udata-decompress-hook.patch— applied after extracting the ICU tarball. Adds a weakextern "C"call between TOC lookup andcheckDataItem; null in ICU's own tools, defined by Bun at link time.icu/compress-data.ts— runs after the existingicupkgfilter. Uses ICU's ownicupkg -l/-xto read the package (no manual format parsing), trains a 128 KB zstd dictionary, compresses each item not inicu/hot-items.txtwith thezstdCLI, writes the package back (UDataOffsetTOC— the one hand-rolled bit, sinceicupkg -arejects non-ICU item bodies), and emitslibicudata.awith the package + dict as.rodatasymbols. Node stdlib +util.parseArgs; runs under Node's native type-stripping.icu/hot-items.txt— everything except non-en display-name items. 1,655 raw / 2,115 compressed; largest compressed item 79 KB.Dockerfile,Dockerfile.musl— installzstd+ Node, apply the patch, run the repacker.Bun-side companion
oven-sh/bun#31200 —
Bun::ICUDecompressorsingleton (the hook) +test/js/web/intl/(30 tests, ~5,900 assertions: snapshots for 12 locales × everyIntlAPI captured against unmodifiedlibicudata.a, plus an exhaustive sweep that loads every compressed item). All pass identically on baseline, compressed-release, and compressed-LTO builds.Measured (real Bun release binaries, hyperfine 50 runs)
bun --versionnew Date().toString()Intl.DateTimeFormat("ja")Intl.Collator("zh")Intl.Segmenter("zh", word)Intl.DisplayNames("ko").of("US")NumberFormat("ru", {style:"unit"})DateTimeFormat("ja", {timeZoneName:"long"})All
Intloutputs byte-identical to baseline (中文\|分词\|测试,미국,1.234,56 €,5 км, …). Regressions are first-call-only, then cached.Memory (
/proc/self/status, 102 locales × 5 trees)RssAnon(pinned)RssFile(evictable)Total resident is flat; up to ~11.3 MB shifts evictable→pinned, reachable only via
DisplayNames/unit/timeZoneName/currencyDisplay:"name"across all ~500 locales. No refcounting (would re-decompress on GC churn).Linux-only / not in this PR
Dockerfile(glibc) andDockerfile.muslonly.build-icu.ps1(Windows, also still on ICU 73.2),Dockerfile.android,Dockerfile.freebsdare untouched; macOS uses systemlibicucore. LTO validated (weak symbol resolves; baseline DCE's the unreachable hook).