Skip to content

Conversation

@The-Obstacle-Is-The-Way
Copy link

Summary

Fixes SIGKILL crash in embed_array_storage when processing sliced/sharded datasets with nested types like Sequence(Nifti()) or Sequence(Image()).

Root cause: When ds.shard() or ds.select() creates a sliced view, array.values on a sliced ListArray returns values with internal offset references. For nested types, PyArrow's C++ layer can crash (SIGKILL, exit code 137) when materializing these sliced nested structs.

Fix: Force a contiguous copy via pa.concat_arrays([array]) when the array has a non-zero offset before processing list/large_list arrays.

Changes

  • Add offset check in embed_array_storage for list/large_list arrays
  • Force contiguous copy when array.offset > 0 to break internal references
  • Add regression tests for sliced arrays with Image, Nifti, and LargeList types

Test plan

  • Added tests/features/test_embed_storage_sliced.py with 3 tests:
    • test_embed_array_storage_sliced_list_image
    • test_embed_array_storage_sliced_list_nifti
    • test_embed_array_storage_sliced_large_list
  • All tests verify embedded.offset == 0 (contiguous result)
  • All tests pass locally
  • ruff check passes

Context

This was discovered while uploading a 270GB neuroimaging dataset (ARC) with Sequence(Nifti()) columns. The process crashed with SIGKILL (no Python traceback) when embed_table_storage was called on sharded data.

Workaround that confirmed the fix: pandas round-trip (shard.to_pandas()Dataset.from_pandas()) which forces a contiguous copy.

Fixes #7894
Related: #6686, #7852, #6790

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

When ds.shard() or ds.select() creates a sliced view, array.values returns
values with internal offset references that can cause PyArrow's C++ layer
to crash with SIGKILL when processing nested types like Sequence(Nifti()).

The fix forces a contiguous copy via pa.concat_arrays([array]) when the
array has a non-zero offset, breaking the internal references before
further processing.

Fixes: #6
1. Remove fork-specific issue URL placeholders (upstream-ready)
2. Add consistency assertions to LargeList test:
   - offset == 0 check
   - content verification (bytes embedded)
3. Add offset == 0 check to Nifti test for consistency
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

embed_table_storage crashes (SIGKILL) on sharded datasets with Sequence() nested types

2 participants