Skip to content

Conversation

@The-Obstacle-Is-The-Way
Copy link
Owner

@The-Obstacle-Is-The-Way The-Obstacle-Is-The-Way commented Dec 6, 2025

Summary

  • Remove import gc and gc.collect() calls that were added for invalid issue #7893
  • Update docstring to reference the actual bug #7894 instead of OOM
  • Clarify workaround comment with link to upstream issue and PR #7896

Background

We filed two bug reports against huggingface/datasets:

  • #7893 (OOM) - Investigation proved this was INVALID. We closed it. The free_memory=True default prevents memory accumulation.
  • #7894 (SIGKILL) - This is VALID. The crash occurs in embed_table_storage with Sequence() nested types after ds.shard().

Our codebase had workarounds for both. This PR removes the unnecessary #7893 workaround while keeping the essential #7894 workaround.

Changes

Removed (invalid #7893) Kept (valid #7894)
import gc Pandas round-trip workaround
gc.collect() call combine_chunks() call
OOM references in docstring Link to upstream #7894

Test Plan

  • uv run pytest - 45 tests pass
  • uv run ruff check . - All checks pass
  • uv run mypy src tests - No issues
  • Pre-commit hooks pass

References

Summary by CodeRabbit

  • Chores
    • Updated internal documentation for dataset processing.
    • Removed unused code dependencies to improve maintainability.

✏️ Tip: You can customize this high-level summary in your review settings.

- Remove `import gc` and `gc.collect()` calls (was for invalid #7893)
- Update docstring to reference actual bug #7894 instead of OOM
- Clarify workaround comment with link to upstream issue and PR

The gc.collect() was added believing we had memory accumulation (#7893).
Investigation proved #7893 was invalid - free_memory=True handles it.
The actual bug is #7894 (SIGKILL on embed_table_storage with Sequence()).

The pandas round-trip workaround remains until upstream PR #7896 is merged.
See: huggingface/datasets#7894
@coderabbitai
Copy link

coderabbitai bot commented Dec 6, 2025

Walkthrough

Documentation and comments in the push_dataset_to_hub function were updated to clarify a workaround for an upstream huggingface/datasets issue. The module docstring was expanded, "CRITICAL FIX" commentary was replaced with detailed explanation, and an unused gc import was removed. No functional behavior changed.

Changes

Cohort / File(s) Summary
Documentation and import cleanup
src/arc_bids/core.py
Expanded module docstring for push_dataset_to_hub to reference huggingface/datasets upstream issue #7894; replaced "CRITICAL FIX" commentary with "WORKAROUND" explanation; updated post-shard cleanup comments; removed unused gc import and explicit gc.collect() call

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

  • Documentation updates are straightforward; verify the referenced upstream issue number and link are accurate
  • Confirm the removal of gc.collect() has no unintended memory management implications in practice

Poem

🐰 Comments now shine with clarity bright,
A workaround named true, set alight!
The shards upload smooth, the prose flows free,
No behavior changed, just clarity we see. ✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and accurately summarizes the main changes: removing the invalid #7893 workaround and clarifying the #7894 workaround, which aligns with the PR objectives and file modifications.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/remove-invalid-7893-workaround

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f6e7e80 and 6fb23ad.

📒 Files selected for processing (1)
  • src/arc_bids/core.py (3 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Use absolute file paths when calling embed_external_files=True in datasets operations
Always explicitly specify num_shards=len(file_table) when pushing datasets to HuggingFace Hub to avoid out-of-memory crashes from incorrect automatic shard estimation

Files:

  • src/arc_bids/core.py
{src,}/**/core.py

📄 CodeRabbit inference engine (CLAUDE.md)

In core.py, implement generic BIDS→HuggingFace conversion logic that is reusable for any dataset, not ARC-specific

Files:

  • src/arc_bids/core.py
src/**/*.py

📄 CodeRabbit inference engine (GEMINI.md)

src/**/*.py: Maintain strict mypy type checking compliance and use from __future__ import annotations in all Python files
Use ruff as the authority for code formatting and linting across the codebase

Files:

  • src/arc_bids/core.py
🧠 Learnings (6)
📓 Common learnings
Learnt from: CR
Repo: The-Obstacle-Is-The-Way/arc-aphasia-bids PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-03T00:02:34.352Z
Learning: Applies to **/test_*.py : In tests, mock `push_dataset_to_hub` calls for dry-run tests to avoid actual Hub uploads
Learnt from: CR
Repo: The-Obstacle-Is-The-Way/arc-aphasia-bids PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-03T00:02:34.352Z
Learning: Applies to **/*.py : Always explicitly specify `num_shards=len(file_table)` when pushing datasets to HuggingFace Hub to avoid out-of-memory crashes from incorrect automatic shard estimation
Learnt from: CR
Repo: The-Obstacle-Is-The-Way/arc-aphasia-bids PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-03T00:02:34.352Z
Learning: Applies to pyproject.toml : Use git-based `datasets` dependency from PyPI main (not stable) to avoid silent empty file creation bug in `Nifti.embed_storage`
Learnt from: CR
Repo: The-Obstacle-Is-The-Way/arc-aphasia-bids PR: 0
File: GEMINI.md:0-0
Timestamp: 2025-12-03T00:02:47.028Z
Learning: Applies to src/arc_bids/arc.py : Force sharding by passing `num_shards=len(file_table)` to `push_dataset_to_hub` in `src/arc_bids/arc.py` inside `build_and_push_arc` to prevent Out-Of-Memory crashes when loading 278GB of NIfTI data
📚 Learning: 2025-12-03T00:02:47.028Z
Learnt from: CR
Repo: The-Obstacle-Is-The-Way/arc-aphasia-bids PR: 0
File: GEMINI.md:0-0
Timestamp: 2025-12-03T00:02:47.028Z
Learning: Applies to src/arc_bids/arc.py : Force sharding by passing `num_shards=len(file_table)` to `push_dataset_to_hub` in `src/arc_bids/arc.py` inside `build_and_push_arc` to prevent Out-Of-Memory crashes when loading 278GB of NIfTI data

Applied to files:

  • src/arc_bids/core.py
📚 Learning: 2025-12-03T00:02:34.352Z
Learnt from: CR
Repo: The-Obstacle-Is-The-Way/arc-aphasia-bids PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-03T00:02:34.352Z
Learning: Applies to {src,}/**/arc.py : For BIDS to HuggingFace conversion, use `build_arc_file_table()` to create a pandas DataFrame with one row per SESSION (not per subject), then pass to `build_hf_dataset()` to create a datasets.Dataset with Nifti() features, finally push to Hub with explicit `num_shards=len(file_table)`

Applied to files:

  • src/arc_bids/core.py
📚 Learning: 2025-12-03T00:02:34.352Z
Learnt from: CR
Repo: The-Obstacle-Is-The-Way/arc-aphasia-bids PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-03T00:02:34.352Z
Learning: Applies to **/*.py : Always explicitly specify `num_shards=len(file_table)` when pushing datasets to HuggingFace Hub to avoid out-of-memory crashes from incorrect automatic shard estimation

Applied to files:

  • src/arc_bids/core.py
📚 Learning: 2025-12-03T00:02:34.352Z
Learnt from: CR
Repo: The-Obstacle-Is-The-Way/arc-aphasia-bids PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-03T00:02:34.352Z
Learning: Applies to **/test_*.py : In tests, mock `push_dataset_to_hub` calls for dry-run tests to avoid actual Hub uploads

Applied to files:

  • src/arc_bids/core.py
📚 Learning: 2025-12-03T00:02:34.352Z
Learnt from: CR
Repo: The-Obstacle-Is-The-Way/arc-aphasia-bids PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-03T00:02:34.352Z
Learning: Applies to {src,}/**/core.py : In `core.py`, implement generic BIDS→HuggingFace conversion logic that is reusable for any dataset, not ARC-specific

Applied to files:

  • src/arc_bids/core.py
🔇 Additional comments (3)
src/arc_bids/core.py (3)

173-176: Docstring now correctly documents the remaining datasets#7894 workaround

The updated docstring accurately describes the sharded-upload behavior and clearly ties the custom path to the validated datasets#7894 SIGKILL issue, with a direct link to the upstream tracker. This removes confusion around the earlier OOM-focused description and aligns the function’s documented purpose with its actual implementation.


224-229: Workaround comment is precise and well-scoped to datasets#7894

The new comment block clearly explains:

  • That the issue is triggered by ds.shard() plus Sequence(Nifti()) columns.
  • Why the pandas round-trip (to_pandasfrom_pandascast) breaks the problematic slice references.
  • When it’s safe to remove the workaround (after upstream PR #7896 lands).

This is a good balance of detail and future-maintenance guidance without affecting runtime behavior.


266-267: Cleanup comment matches removal of the unnecessary gc workaround

Replacing the explicit gc.collect() call with a clearer comment about deleting shard to allow normal garbage collection keeps the intent obvious while dropping the now-known-to-be-invalid #7893-related workaround. Given the rest of the loop already aggressively releases intermediates and streams via temporary files, this change is reasonable and keeps the function focused on the one valid workaround (datasets#7894).


Comment @coderabbitai help to get the list of available commands and usage tips.

@The-Obstacle-Is-The-Way The-Obstacle-Is-The-Way merged commit 5ecbc8a into main Dec 6, 2025
1 check passed
@The-Obstacle-Is-The-Way The-Obstacle-Is-The-Way deleted the fix/remove-invalid-7893-workaround branch December 6, 2025 01:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants