fix: force contiguous copy for sliced list arrays in embed_array_storage #7896
+131
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Fixes SIGKILL crash in
embed_array_storagewhen processing sliced/sharded datasets with nested types likeSequence(Nifti())orSequence(Image()).Root cause: When
ds.shard()ords.select()creates a sliced view,array.valueson a slicedListArrayreturns values with internal offset references. For nested types, PyArrow's C++ layer can crash (SIGKILL, exit code 137) when materializing these sliced nested structs.Fix: Force a contiguous copy via
pa.concat_arrays([array])when the array has a non-zero offset before processing list/large_list arrays.Changes
embed_array_storagefor list/large_list arraysarray.offset > 0to break internal referencesTest plan
tests/features/test_embed_storage_sliced.pywith 3 tests:test_embed_array_storage_sliced_list_imagetest_embed_array_storage_sliced_list_niftitest_embed_array_storage_sliced_large_listembedded.offset == 0(contiguous result)Context
This was discovered while uploading a 270GB neuroimaging dataset (ARC) with
Sequence(Nifti())columns. The process crashed with SIGKILL (no Python traceback) whenembed_table_storagewas called on sharded data.Workaround that confirmed the fix: pandas round-trip (
shard.to_pandas()→Dataset.from_pandas()) which forces a contiguous copy.Fixes #7894
Related: #6686, #7852, #6790