-
Notifications
You must be signed in to change notification settings - Fork 0
refactor: remove invalid #7893 workaround, clarify #7894 workaround #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: remove invalid #7893 workaround, clarify #7894 workaround #13
Conversation
- Remove `import gc` and `gc.collect()` calls (was for invalid #7893) - Update docstring to reference actual bug #7894 instead of OOM - Clarify workaround comment with link to upstream issue and PR The gc.collect() was added believing we had memory accumulation (#7893). Investigation proved #7893 was invalid - free_memory=True handles it. The actual bug is #7894 (SIGKILL on embed_table_storage with Sequence()). The pandas round-trip workaround remains until upstream PR #7896 is merged. See: huggingface/datasets#7894
WalkthroughDocumentation and comments in the Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used📓 Path-based instructions (3)**/*.py📄 CodeRabbit inference engine (CLAUDE.md)
Files:
{src,}/**/core.py📄 CodeRabbit inference engine (CLAUDE.md)
Files:
src/**/*.py📄 CodeRabbit inference engine (GEMINI.md)
Files:
🧠 Learnings (6)📓 Common learnings📚 Learning: 2025-12-03T00:02:47.028ZApplied to files:
📚 Learning: 2025-12-03T00:02:34.352ZApplied to files:
📚 Learning: 2025-12-03T00:02:34.352ZApplied to files:
📚 Learning: 2025-12-03T00:02:34.352ZApplied to files:
📚 Learning: 2025-12-03T00:02:34.352ZApplied to files:
🔇 Additional comments (3)
Comment |
Summary
import gcandgc.collect()calls that were added for invalid issue #7893Background
We filed two bug reports against
huggingface/datasets:free_memory=Truedefault prevents memory accumulation.embed_table_storagewithSequence()nested types afterds.shard().Our codebase had workarounds for both. This PR removes the unnecessary #7893 workaround while keeping the essential #7894 workaround.
Changes
import gcgc.collect()callcombine_chunks()callTest Plan
uv run pytest- 45 tests passuv run ruff check .- All checks passuv run mypy src tests- No issuesReferences
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.