Bug
get_device_context() builds a new torch.tensor from self.heap_bases.tolist() on every call (see #466). Once #466 is fixed by precomputing the tensor in __init__, the context tensor will hold a snapshot of heap_bases at construction time.
If heap_bases were to change after init (e.g., via refresh_peer_access() after a new shmem.allocate() or as_symmetric() call with a future allocator), the precomputed context tensor would contain stale base addresses. Kernels using DeviceContext would translate pointers using wrong bases, causing silent data corruption or hangs.
Today this is not a bug — both the torch and vmem allocators produce stable heap_bases after the first refresh_peer_access(). But it will become one if an allocator ever remaps peer VA ranges.
Fix
After precomputing self._device_context in __init__, add an in-place update in refresh_peer_access():
self._device_context[2:2+self.num_ranks] = self.heap_bases
No allocation, CUDAGraph safe, one line.
Component
iris/iris.py, iris/symmetric_heap.py
Bug
get_device_context()builds a newtorch.tensorfromself.heap_bases.tolist()on every call (see #466). Once #466 is fixed by precomputing the tensor in__init__, the context tensor will hold a snapshot of heap_bases at construction time.If
heap_baseswere to change after init (e.g., viarefresh_peer_access()after a newshmem.allocate()oras_symmetric()call with a future allocator), the precomputed context tensor would contain stale base addresses. Kernels usingDeviceContextwould translate pointers using wrong bases, causing silent data corruption or hangs.Today this is not a bug — both the torch and vmem allocators produce stable heap_bases after the first
refresh_peer_access(). But it will become one if an allocator ever remaps peer VA ranges.Fix
After precomputing
self._device_contextin__init__, add an in-place update inrefresh_peer_access():No allocation, CUDAGraph safe, one line.
Component
iris/iris.py,iris/symmetric_heap.py