Skip to content

Metal: protect tensor alloc/free byte counters with a mutex#420

Open
aledesogusbusiness-hue wants to merge 1 commit into
antirez:mainfrom
aledesogusbusiness-hue:fix/metal-tensor-counter-race
Open

Metal: protect tensor alloc/free byte counters with a mutex#420
aledesogusbusiness-hue wants to merge 1 commit into
antirez:mainfrom
aledesogusbusiness-hue:fix/metal-tensor-counter-race

Conversation

@aledesogusbusiness-hue

Copy link
Copy Markdown

Problem

g_tensor_alloc_live_bytes and g_tensor_alloc_peak_bytes in ds4_metal.m are plain uint64_t globals updated by ds4_gpu_tensor_alloc() and ds4_gpu_tensor_free() without any synchronisation.

When multiple worker threads concurrently allocate or free Metal tensors (concurrent session cleanup racing with new task startup, or back-to-back requests on the same server), the unguarded read-modify-write is a data race. On ARM64 the 64-bit load/store is not guaranteed atomic, and the torn write can corrupt the counter or — when realloc is called shortly after — trigger the malloc freelist corruption that shows up as EXC_BREAKPOINT / BUG IN CLIENT OF LIBMALLOC in ds4_gpu_tensor_free.

Fix

Add g_tensor_alloc_mu (PTHREAD_MUTEX_INITIALIZER) and hold it around every update to the two counters and around the diagnostic snapshot read. The fprintf is kept outside the lock (it uses a local snapshot) so no I/O is ever done under the mutex.

Test plan

  • No observable behaviour change under single-threaded usage.
  • Under concurrent load the counters and the trace log remain consistent and the malloc corruption no longer occurs.

Fixes #404

🤖 Generated with Claude Code

g_tensor_alloc_live_bytes and g_tensor_alloc_peak_bytes were plain
uint64_t globals updated by ds4_gpu_tensor_alloc() and ds4_gpu_tensor_free()
without any synchronisation.  When multiple worker threads concurrently
allocate or free Metal tensors (e.g. during concurrent session cleanup
and new task startup) the unguarded read-modify-write can corrupt the
counters and, on ARM64, the non-atomic 64-bit access can race.

Add g_tensor_alloc_mu (PTHREAD_MUTEX_INITIALIZER) and take it around
every update and the diagnostic snapshot read.  The fprintf stays
outside the lock; it uses a local snapshot so the lock is never held
across an I/O call.

Fixes: antirez#404

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ds4-server crash: BUG IN CLIENT OF LIBMALLOC — memory corruption of free block in Metal tensor free path (EXC_BREAKPOINT SIGTRAP)

1 participant