refactor: remove invalid #7893 workaround, clarify #7894 workaround #13

The-Obstacle-Is-The-Way · 2025-12-06T01:36:46Z

Summary

Remove import gc and gc.collect() calls that were added for invalid issue #7893
Update docstring to reference the actual bug #7894 instead of OOM
Clarify workaround comment with link to upstream issue and PR #7896

Background

We filed two bug reports against huggingface/datasets:

#7893 (OOM) - Investigation proved this was INVALID. We closed it. The free_memory=True default prevents memory accumulation.
#7894 (SIGKILL) - This is VALID. The crash occurs in embed_table_storage with Sequence() nested types after ds.shard().

Our codebase had workarounds for both. This PR removes the unnecessary #7893 workaround while keeping the essential #7894 workaround.

Changes

Removed (invalid #7893)	Kept (valid #7894)
`import gc`	Pandas round-trip workaround
`gc.collect()` call	`combine_chunks()` call
OOM references in docstring	Link to upstream #7894

Test Plan

uv run pytest - 45 tests pass
uv run ruff check . - All checks pass
uv run mypy src tests - No issues
Pre-commit hooks pass

References

Upstream issue: embed_table_storage crashes (SIGKILL) on sharded datasets with Sequence() nested types huggingface/datasets#7894
Upstream fix PR: fix: force contiguous copy for sliced list arrays in embed_array_storage huggingface/datasets#7896
Closed invalid issue: push_to_hub OOM: _push_parquet_shards_to_hub accumulates all shard bytes in memory huggingface/datasets#7893

Summary by CodeRabbit

Chores
- Updated internal documentation for dataset processing.
- Removed unused code dependencies to improve maintainability.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

- Remove `import gc` and `gc.collect()` calls (was for invalid #7893) - Update docstring to reference actual bug #7894 instead of OOM - Clarify workaround comment with link to upstream issue and PR The gc.collect() was added believing we had memory accumulation (#7893). Investigation proved #7893 was invalid - free_memory=True handles it. The actual bug is #7894 (SIGKILL on embed_table_storage with Sequence()). The pandas round-trip workaround remains until upstream PR #7896 is merged. See: huggingface/datasets#7894

coderabbitai · 2025-12-06T01:36:54Z

Walkthrough

Documentation and comments in the push_dataset_to_hub function were updated to clarify a workaround for an upstream huggingface/datasets issue. The module docstring was expanded, "CRITICAL FIX" commentary was replaced with detailed explanation, and an unused gc import was removed. No functional behavior changed.

Changes

Cohort / File(s)	Summary
Documentation and import cleanup `src/arc_bids/core.py`	Expanded module docstring for `push_dataset_to_hub` to reference huggingface/datasets upstream issue `#7894`; replaced "CRITICAL FIX" commentary with "WORKAROUND" explanation; updated post-shard cleanup comments; removed unused `gc` import and explicit `gc.collect()` call

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Documentation updates are straightforward; verify the referenced upstream issue number and link are accurate
Confirm the removal of gc.collect() has no unintended memory management implications in practice

Poem

🐰 Comments now shine with clarity bright,
A workaround named true, set alight!
The shards upload smooth, the prose flows free,
No behavior changed, just clarity we see. ✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and accurately summarizes the main changes: removing the invalid #7893 workaround and clarifying the #7894 workaround, which aligns with the PR objectives and file modifications.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/remove-invalid-7893-workaround

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f6e7e80 and 6fb23ad.

📒 Files selected for processing (1)

src/arc_bids/core.py (3 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Use absolute file paths when calling embed_external_files=True in datasets operations
Always explicitly specify num_shards=len(file_table) when pushing datasets to HuggingFace Hub to avoid out-of-memory crashes from incorrect automatic shard estimation

Files:

src/arc_bids/core.py

{src,}/**/core.py

📄 CodeRabbit inference engine (CLAUDE.md)

In core.py, implement generic BIDS→HuggingFace conversion logic that is reusable for any dataset, not ARC-specific

Files:

src/arc_bids/core.py

src/**/*.py

📄 CodeRabbit inference engine (GEMINI.md)

src/**/*.py: Maintain strict mypy type checking compliance and use from __future__ import annotations in all Python files
Use ruff as the authority for code formatting and linting across the codebase

Files:

src/arc_bids/core.py

🧠 Learnings (6)

📓 Common learnings

Learnt from: CR
Repo: The-Obstacle-Is-The-Way/arc-aphasia-bids PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-03T00:02:34.352Z
Learning: Applies to **/test_*.py : In tests, mock `push_dataset_to_hub` calls for dry-run tests to avoid actual Hub uploads

Learnt from: CR
Repo: The-Obstacle-Is-The-Way/arc-aphasia-bids PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-03T00:02:34.352Z
Learning: Applies to **/*.py : Always explicitly specify `num_shards=len(file_table)` when pushing datasets to HuggingFace Hub to avoid out-of-memory crashes from incorrect automatic shard estimation

Learnt from: CR
Repo: The-Obstacle-Is-The-Way/arc-aphasia-bids PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-03T00:02:34.352Z
Learning: Applies to pyproject.toml : Use git-based `datasets` dependency from PyPI main (not stable) to avoid silent empty file creation bug in `Nifti.embed_storage`

Learnt from: CR
Repo: The-Obstacle-Is-The-Way/arc-aphasia-bids PR: 0
File: GEMINI.md:0-0
Timestamp: 2025-12-03T00:02:47.028Z
Learning: Applies to src/arc_bids/arc.py : Force sharding by passing `num_shards=len(file_table)` to `push_dataset_to_hub` in `src/arc_bids/arc.py` inside `build_and_push_arc` to prevent Out-Of-Memory crashes when loading 278GB of NIfTI data

📚 Learning: 2025-12-03T00:02:47.028Z

Learnt from: CR
Repo: The-Obstacle-Is-The-Way/arc-aphasia-bids PR: 0
File: GEMINI.md:0-0
Timestamp: 2025-12-03T00:02:47.028Z
Learning: Applies to src/arc_bids/arc.py : Force sharding by passing `num_shards=len(file_table)` to `push_dataset_to_hub` in `src/arc_bids/arc.py` inside `build_and_push_arc` to prevent Out-Of-Memory crashes when loading 278GB of NIfTI data

Applied to files:

src/arc_bids/core.py

📚 Learning: 2025-12-03T00:02:34.352Z

Learnt from: CR
Repo: The-Obstacle-Is-The-Way/arc-aphasia-bids PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-03T00:02:34.352Z
Learning: Applies to {src,}/**/arc.py : For BIDS to HuggingFace conversion, use `build_arc_file_table()` to create a pandas DataFrame with one row per SESSION (not per subject), then pass to `build_hf_dataset()` to create a datasets.Dataset with Nifti() features, finally push to Hub with explicit `num_shards=len(file_table)`

Applied to files:

src/arc_bids/core.py

📚 Learning: 2025-12-03T00:02:34.352Z

Learnt from: CR
Repo: The-Obstacle-Is-The-Way/arc-aphasia-bids PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-03T00:02:34.352Z
Learning: Applies to **/*.py : Always explicitly specify `num_shards=len(file_table)` when pushing datasets to HuggingFace Hub to avoid out-of-memory crashes from incorrect automatic shard estimation

Applied to files:

src/arc_bids/core.py

📚 Learning: 2025-12-03T00:02:34.352Z

Learnt from: CR
Repo: The-Obstacle-Is-The-Way/arc-aphasia-bids PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-03T00:02:34.352Z
Learning: Applies to **/test_*.py : In tests, mock `push_dataset_to_hub` calls for dry-run tests to avoid actual Hub uploads

Applied to files:

src/arc_bids/core.py

📚 Learning: 2025-12-03T00:02:34.352Z

Learnt from: CR
Repo: The-Obstacle-Is-The-Way/arc-aphasia-bids PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-03T00:02:34.352Z
Learning: Applies to {src,}/**/core.py : In `core.py`, implement generic BIDS→HuggingFace conversion logic that is reusable for any dataset, not ARC-specific

Applied to files:

src/arc_bids/core.py

🔇 Additional comments (3)

src/arc_bids/core.py (3)

173-176: Docstring now correctly documents the remaining datasets#7894 workaround

The updated docstring accurately describes the sharded-upload behavior and clearly ties the custom path to the validated datasets#7894 SIGKILL issue, with a direct link to the upstream tracker. This removes confusion around the earlier OOM-focused description and aligns the function’s documented purpose with its actual implementation.

224-229: Workaround comment is precise and well-scoped to datasets#7894

The new comment block clearly explains:

That the issue is triggered by ds.shard() plus Sequence(Nifti()) columns.

Why the pandas round-trip (to_pandas → from_pandas → cast) breaks the problematic slice references.

When it’s safe to remove the workaround (after upstream PR #7896 lands).

This is a good balance of detail and future-maintenance guidance without affecting runtime behavior.

266-267: Cleanup comment matches removal of the unnecessary gc workaround

Replacing the explicit gc.collect() call with a clearer comment about deleting shard to allow normal garbage collection keeps the intent obvious while dropping the now-known-to-be-invalid #7893-related workaround. Given the rest of the loop already aggressively releases intermediates and streams via temporary files, this change is reasonable and keeps the function focused on the one valid workaround (datasets#7894).

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

The-Obstacle-Is-The-Way merged commit 5ecbc8a into main Dec 6, 2025
1 check passed

The-Obstacle-Is-The-Way deleted the fix/remove-invalid-7893-workaround branch December 6, 2025 01:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: remove invalid #7893 workaround, clarify #7894 workaround #13

refactor: remove invalid #7893 workaround, clarify #7894 workaround #13

Uh oh!

The-Obstacle-Is-The-Way commented Dec 6, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

refactor: remove invalid #7893 workaround, clarify #7894 workaround #13

refactor: remove invalid #7893 workaround, clarify #7894 workaround #13

Uh oh!

Conversation

The-Obstacle-Is-The-Way commented Dec 6, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Background

Changes

Test Plan

References

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

The-Obstacle-Is-The-Way commented Dec 6, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 6, 2025 •

edited

Loading