Date: June 24, 2026 03:42 UTC
Test: RetinaNet Training - 4 Client, CLOSED Division
Summary
MLPerf Storage RetinaNet training benchmark in CLOSED division failed during warmup phase with a RuntimeError: dispatch failure in the s3dlio library. The test successfully passed validation and completed file listing (20+ minutes) but crashed immediately when attempting to read data during the warmup iteration.
Test Configuration (Redacted)
Parameter Value
Benchmark mlpstorage 3.0.16
Division CLOSED
Workload RetinaNet Training
Model RetinaNet B200
Clients 4 nodes (HOST1, HOST2, HOST3, HOST4)
Processes 16 (4 per client)
Accelerators 16 B200
Memory/Host 754 GB
Dataset 50,203,282 files (full dataset)
Epochs 8
Batch Size 24
Storage Configuration:
Library: s3dlio
Endpoints: 4 S3 endpoints (REDACTED)
Region: us-east-1
Load Balancing: round_robin
Concurrency: 64 in-flight per rank
Workers: 8 per rank
❌ Failure Details
Error Message:
RuntimeError: dispatch failure
Error Location:
python
File: dlio_benchmark/reader/_s3_iterable_mixin.py, line 424
Function: _s3_stream_s3dlio()
Code: while batch := item_iter.collect_batch(collect_n):
Full Stack Trace:
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "torch/utils/data/_utils/worker.py", line 374, in _worker_loop
data = fetcher.fetch(index)
File "torch/utils/data/_utils/fetch.py", line 44, in fetch
data = next(self.dataset_iter)
File "dlio_benchmark/data_loader/torch_data_loader.py", line 403, in iter
for _batch in self.reader.next():
File "dlio_benchmark/reader/image_reader_s3_iterable.py", line 83, in next
yield from self._s3_stream_next()
File "dlio_benchmark/reader/_s3_iterable_mixin.py", line 484, in _s3_stream_next
yield from self._s3_stream_s3dlio(obj_keys)
File "dlio_benchmark/reader/_s3_iterable_mixin.py", line 424, in _s3_stream_s3dlio
while batch := item_iter.collect_batch(collect_n):
RuntimeError: dispatch failure
⏱️ Timeline
Phase Time Duration Status
Start 03:42:17 - ✅
Environment Validation 03:42:17-03:42:21 4s ✅
File Listing 03:42:25-04:02:47 20m 22s ✅
Reshard for Epoch 1 04:02:48-04:02:50 2.29s ✅
Warmup Iteration 04:02:50+ <1s ❌ FAILED
Total Runtime 03:42:17-04:03:06 20m 49s ❌
✅ What Worked
✅ Environment validation passed
✅ CLOSED division qualification passed
✅ MPI collection across 4 hosts successful
✅ File listing completed (50.2M files in 20m 22s)
✅ File sharding completed (3,137,705 files per rank)
✅ Epoch reshard completed (2.29s)
✅ DataLoader initialization successful
[ITER_SIMPLE] worker=2 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=6 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=5 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=3 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=1 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=0 reader=ImageReaderS3Iterable files_this_worker=392214 total_files=3137705
[ITER_SIMPLE] worker=7 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=7 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
Error executing job with overrides: ['workload=retinanet_b200', '++workload.storage.storage_type=s3', '++workload.storage.storage_root=retinanet3', '++workload.dataset.skip_listing=true', '++workload.dataset.listing_validation_interval=10000', '++workload.dataset.num_files_train=50203282', '++workload.storage.storage_options.storage_library=s3dlio', '++workload.storage.storage_options.uri_scheme=s3', '++workload.storage.s3_force_path_style=true', '++workload.dataset.data_folder=retinanet_64p/retinanet']
Traceback (most recent call last):
File "/root/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 781, in run_benchmark
benchmark.run()
File "/root/storage/.venv/lib/python3.12/site-packages/dftracer/python/ai_common.py", line 170, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/root/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 666, in run
next(warmup_iter)
File "/root/storage/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 718, in next
data = self._next_data()
^^^^^^^^^^^^^^^^^
File "/root/storage/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1525, in _next_data
return self._process_data(data, worker_id)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/storage/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1563, in _process_data
data.reraise()
File "/root/storage/.venv/lib/python3.12/site-packages/torch/_utils.py", line 774, in reraise
raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/root/storage/.venv/lib/python3.12/site-packages/torch/utils/data/_utils/worker.py", line 374, in _worker_loop
data = fetcher.fetch(index) # type: ignore[possibly-undefined]
^^^^^^^^^^^^^^^^^^^^
File "/root/storage/.venv/lib/python3.12/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = next(self.dataset_iter)
^^^^^^^^^^^^^^^^^^^^^^^
File "/root/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/data_loader/torch_data_loader.py", line 403, in iter
for _batch in self.reader.next():
^^^^^^^^^^^^^^^^^^
File "/root/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/reader/image_reader_s3_iterable.py", line 83, in next
yield from self._s3_stream_next()
File "/root/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/reader/_s3_iterable_mixin.py", line 484, in _s3_stream_next
yield from self._s3_stream_s3dlio(obj_keys)
File "/root/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/reader/_s3_iterable_mixin.py", line 424, in _s3_stream_s3dlio
while batch := item_iter.collect_batch(collect_n):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: dispatch failure
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[ITER_SIMPLE] worker=4 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=7 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=7 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=7 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=7 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=7 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=7 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=7 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=7 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=7 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
Error executing job with overrides: ['workload=retinanet_b200', '++workload.storage.storage_type=s3', '++workload.storage.storage_root=retinanet3', '++workload.dataset.skip_listing=true', '++workload.dataset.listing_validation_interval=10000', '++workload.dataset.num_files_train=50203282', '++workload.storage.storage_options.storage_library=s3dlio', '++workload.storage.storage_options.uri_scheme=s3', '++workload.storage.s3_force_path_style=true', '++workload.dataset.data_folder=retinanet_64p/retinanet']
Traceback (most recent call last):
File "/root/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 781, in run_benchmark
benchmark.run()
File "/root/storage/.venv/lib/python3.12/site-packages/dftracer/python/ai_common.py", line 170, in wrapper
Date: June 24, 2026 03:42 UTC
Test: RetinaNet Training - 4 Client, CLOSED Division
Summary
MLPerf Storage RetinaNet training benchmark in CLOSED division failed during warmup phase with a RuntimeError: dispatch failure in the s3dlio library. The test successfully passed validation and completed file listing (20+ minutes) but crashed immediately when attempting to read data during the warmup iteration.
Test Configuration (Redacted)
Parameter Value
Benchmark mlpstorage 3.0.16
Division CLOSED
Workload RetinaNet Training
Model RetinaNet B200
Clients 4 nodes (HOST1, HOST2, HOST3, HOST4)
Processes 16 (4 per client)
Accelerators 16 B200
Memory/Host 754 GB
Dataset 50,203,282 files (full dataset)
Epochs 8
Batch Size 24
Storage Configuration:
Library: s3dlio
Endpoints: 4 S3 endpoints (REDACTED)
Region: us-east-1
Load Balancing: round_robin
Concurrency: 64 in-flight per rank
Workers: 8 per rank
❌ Failure Details
Error Message:
RuntimeError: dispatch failure
Error Location:
python
File: dlio_benchmark/reader/_s3_iterable_mixin.py, line 424
Function: _s3_stream_s3dlio()
Code: while batch := item_iter.collect_batch(collect_n):
Full Stack Trace:
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "torch/utils/data/_utils/worker.py", line 374, in _worker_loop
data = fetcher.fetch(index)
File "torch/utils/data/_utils/fetch.py", line 44, in fetch
data = next(self.dataset_iter)
File "dlio_benchmark/data_loader/torch_data_loader.py", line 403, in iter
for _batch in self.reader.next():
File "dlio_benchmark/reader/image_reader_s3_iterable.py", line 83, in next
yield from self._s3_stream_next()
File "dlio_benchmark/reader/_s3_iterable_mixin.py", line 484, in _s3_stream_next
yield from self._s3_stream_s3dlio(obj_keys)
File "dlio_benchmark/reader/_s3_iterable_mixin.py", line 424, in _s3_stream_s3dlio
while batch := item_iter.collect_batch(collect_n):
RuntimeError: dispatch failure
⏱️ Timeline
Phase Time Duration Status
Start 03:42:17 - ✅
Environment Validation 03:42:17-03:42:21 4s ✅
File Listing 03:42:25-04:02:47 20m 22s ✅
Reshard for Epoch 1 04:02:48-04:02:50 2.29s ✅
Warmup Iteration 04:02:50+ <1s ❌ FAILED
Total Runtime 03:42:17-04:03:06 20m 49s ❌
✅ What Worked
✅ Environment validation passed
✅ CLOSED division qualification passed
✅ MPI collection across 4 hosts successful
✅ File listing completed (50.2M files in 20m 22s)
✅ File sharding completed (3,137,705 files per rank)
✅ Epoch reshard completed (2.29s)
✅ DataLoader initialization successful
[ITER_SIMPLE] worker=2 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=6 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=5 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=3 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=1 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=0 reader=ImageReaderS3Iterable files_this_worker=392214 total_files=3137705
[ITER_SIMPLE] worker=7 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=7 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
Error executing job with overrides: ['workload=retinanet_b200', '++workload.storage.storage_type=s3', '++workload.storage.storage_root=retinanet3', '++workload.dataset.skip_listing=true', '++workload.dataset.listing_validation_interval=10000', '++workload.dataset.num_files_train=50203282', '++workload.storage.storage_options.storage_library=s3dlio', '++workload.storage.storage_options.uri_scheme=s3', '++workload.storage.s3_force_path_style=true', '++workload.dataset.data_folder=retinanet_64p/retinanet']
Traceback (most recent call last):
File "/root/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 781, in run_benchmark
benchmark.run()
File "/root/storage/.venv/lib/python3.12/site-packages/dftracer/python/ai_common.py", line 170, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/root/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 666, in run
next(warmup_iter)
File "/root/storage/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 718, in next
data = self._next_data()
^^^^^^^^^^^^^^^^^
File "/root/storage/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1525, in _next_data
return self._process_data(data, worker_id)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/storage/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1563, in _process_data
data.reraise()
File "/root/storage/.venv/lib/python3.12/site-packages/torch/_utils.py", line 774, in reraise
raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/root/storage/.venv/lib/python3.12/site-packages/torch/utils/data/_utils/worker.py", line 374, in _worker_loop
data = fetcher.fetch(index) # type: ignore[possibly-undefined]
^^^^^^^^^^^^^^^^^^^^
File "/root/storage/.venv/lib/python3.12/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = next(self.dataset_iter)
^^^^^^^^^^^^^^^^^^^^^^^
File "/root/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/data_loader/torch_data_loader.py", line 403, in iter
for _batch in self.reader.next():
^^^^^^^^^^^^^^^^^^
File "/root/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/reader/image_reader_s3_iterable.py", line 83, in next
yield from self._s3_stream_next()
File "/root/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/reader/_s3_iterable_mixin.py", line 484, in _s3_stream_next
yield from self._s3_stream_s3dlio(obj_keys)
File "/root/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/reader/_s3_iterable_mixin.py", line 424, in _s3_stream_s3dlio
while batch := item_iter.collect_batch(collect_n):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: dispatch failure
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[ITER_SIMPLE] worker=4 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=7 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=7 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=7 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=7 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=7 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=7 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=7 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=7 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
[ITER_SIMPLE] worker=7 reader=ImageReaderS3Iterable files_this_worker=392213 total_files=3137705
Error executing job with overrides: ['workload=retinanet_b200', '++workload.storage.storage_type=s3', '++workload.storage.storage_root=retinanet3', '++workload.dataset.skip_listing=true', '++workload.dataset.listing_validation_interval=10000', '++workload.dataset.num_files_train=50203282', '++workload.storage.storage_options.storage_library=s3dlio', '++workload.storage.storage_options.uri_scheme=s3', '++workload.storage.s3_force_path_style=true', '++workload.dataset.data_folder=retinanet_64p/retinanet']
Traceback (most recent call last):
File "/root/storage/.venv/lib/python3.12/site-packages/dlio_benchmark/main.py", line 781, in run_benchmark
benchmark.run()
File "/root/storage/.venv/lib/python3.12/site-packages/dftracer/python/ai_common.py", line 170, in wrapper