Skip to content

Skip CWeapon piece-transform reads when generation matches #21

Open
bruno-dasilva wants to merge 10 commits into
masterfrom
bruno/try-optimize-weapons-2
Open

Skip CWeapon piece-transform reads when generation matches #21
bruno-dasilva wants to merge 10 commits into
masterfrom
bruno/try-optimize-weapons-2

Conversation

@bruno-dasilva

Copy link
Copy Markdown
Owner

No description provided.

lhog and others added 10 commits April 22, 2026 05:30
* Optimize LocalModelPiece. Move some cold data to pointers
* Apply optimized loop over dirty pieces (only recalculate relative transforms where changed)
UpdateModelSpaceTransform was writing a full 64-byte CMatrix44f per piece per
tick to a field only consumed by render/collision/Lua read paths. At ~10k
mostly-moving units this was ~6.4 MB/frame of pure store traffic on the sim
hot path, contributing materially to L3 misses.

modelSpaceMat is fully derivable from modelSpaceTra. Track a matDirty flag
and refill the matrix lazily inside GetModelSpaceMatrix() the first time it
is read after modelSpaceTra changes. All four mutation sites
(UpdateModelSpaceTransform x2, UpdateChildTransformRec, UpdateParentMatricesRec)
now flip matDirty instead of computing the matrix.

Public API is unchanged; all external consumers already go through the
getter. creg keeps modelSpaceMat as CR_MEMBER for save-format compatibility,
and PostLoad sets matDirty=true so the first read after load recomputes
from the (also-serialised) modelSpaceTra. Sync semantics are preserved -
modelSpaceMat is a derived cache, never ground-truth state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the deque-driven BFS with a linear iteration over the contiguous
LocalModel::pieces vector. Pieces are emplace_back'd in pre-order DFS by
LocalModel::CreateLocalModelPieces, so a parent's modelSpaceTra is always
up-to-date by the time its child reads it via the pointer overload of
UpdateModelSpaceTransform - no need to thread the parent transform through
a deque.

Wins:
- HW prefetcher streams pieces[] (was: deque-pop killed prefetch).
- Drops the 40-byte (LocalModelPiece*, Transform) deque element copy per
  visit and the thread_local std::deque allocator traffic.
- children[] is no longer touched in the hot path, freeing a cacheline of
  pressure per piece.

Lock the parent-index < child-index invariant in LocalModelPiece::AddChild
with an assert so future construction-time changes can't silently break the
ordering the linear sweep depends on.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reshuffle declaration order so the fields touched by the per-tick animation
sweep pack into the first two cachelines (offsets 0-128). Cold fields
(modelSpaceMat lazily-filled per the prior commit, prevModelSpaceTra, dir,
script/Lua-piece indices, localModel, colvol, children, lodDispLists) move
past offset 128.

Per-visit hot footprint drops from ~5 spread cachelines to 2 contiguous
ones. At ~10k-units-moving × ~10 pieces, the streaming working set goes
from roughly 32 MB (over typical L3) to ~13 MB (within L3 of most modern
desktops).

No struct split, no API change, no creg metadata change - creg looks up
fields via offsetof at registration, so reorder is transparent to the save
format. Multiple private:/public: blocks keep external code source-
compatible regardless of declaration order.

Constructor init lists in both ctors are reordered to match the new
declaration order, also clearing the pre-existing -Wreorder-ctor warnings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CSolidObject::UpdatePrevFrameTransform was constructing a full CMatrix44f
via GetTransformMatrix(true) only for CQuaternion::MakeFrom to read the
3x3 rotation part. Add CQuaternion::FromAxes(x, y, z) that runs the same
trace-branch math on the basis columns directly, and route MakeFrom on a
matrix through it. Skips the matrix round-trip for every active unit and
feature each sim frame.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CModelDrawerDataBase::UpdateObjectTrasform was building tmCurr via
Transform::FromMatrix(GetTransformMatrix(true)), which constructs a full
CMatrix44f and then runs DecomposeIntoTRS (Det3 + 3 column lengths +
3 column divisions + MakeFrom). Both CUnit and CFeature paths produce a
matrix shaped like ComposeMatrix(pos), so we can build the same
Transform directly from (-rightdir, updir, frontdir, pos) via
CQuaternion::FromAxes. Runs once per visible model per render frame.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CSolidObject::UpdatePrevFrameTransform ran the per-piece
prevModelSpaceTra copy and rebuilt preFrameTra unconditionally for every
active unit and feature each sim frame, even when the object hadn't
moved or animated. That's the common case in an RTS battle (parked
vehicles, finished wreckage, idle gathered units, all buildings).

Add a per-object preFrameDirty flag, set at the choke points that mutate
pos/dirs (Move, ForcedSpin, UpdateDirVectors, SetDirVectors) and
propagated from per-piece script setters (LocalModelPiece::SetFloat3 /
SetFloat) via a new LocalModel→CSolidObject back-pointer. Cleared at
the end of UpdatePrevFrameTransform; when clean we early-out before
touching either the piece loop or the FromAxes rebuild.

Measured ~20% faster Sim::Unit::UpdatePreFrame and ~50% faster
Sim::Features::UpdatePreFrame on a representative scene.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
UpdateWeaponVectors runs MT on every active unit each sim frame and
re-resolves aimFromPiece/muzzlePiece script-piece indices through
SafeGetPiece on every call. Both the indices and the script's pieces[]
array only change at script init/state changes, so the indirection is
pure overhead.

Cache the resolved LocalModelPiece* on the weapon and rebind from the
existing piece-update sites: CWeapon::UpdateWeaponPieces (collapsed to
a single trailing rebind), and CCobInstance::MapScriptToModelPieces /
CLuaUnitScript ctor + PostLoad / CNullUnitScript::PostLoad via a new
CUnitScript::RebindWeaponPieceCaches helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After plan 1, UpdateWeaponVectors still calls GetAbsolutePos and
GetEmitDirPos on the cached LocalModelPieces every frame for every
weapon. For most weapons most frames the piece's modelSpaceTra hasn't
changed (the unit may move in world space, but the muzzle/aim piece
is static relative to the unit body), so the quat-rotate arithmetic
inside GetEmitDirPos is wasted work.

Add a uint32_t modelSpaceTraGen counter on LocalModelPiece, bump it at
every modelSpaceTra write, and have CWeapon track the last-seen gen
per cached piece. When the gen matches, skip the script-piece read
and reuse the cached relAimFromPos / relWeaponMuzzlePos / relWeaponDir.
The world-space conversion (GetObjectSpacePos / GetObjectSpaceVec) and
popup ground check still run unconditionally, so moving units stay
correct.

relWeaponDir is split out from weaponDir (which used to be aliased as
both the script-emit-dir output and the world-space output) and added
to creg, the constructor init list, and SaveWeaponVectors /
LoadWeaponVectors so save/load and TestTarget round-trip cleanly.
ReBindLocalModelPieces resets both gen sentinels to ~0u so script
reloads force a re-read.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bruno-dasilva

Copy link
Copy Markdown
Owner Author
Screenshot 2026-04-27 at 5 11 13 AM Screenshot 2026-04-27 at 5 11 55 AM

@github-actions

Copy link
Copy Markdown

bar-benchmark — PR #21

candidate 4cdea21 vs baseline eb1c69f

sim trimmed mean (ms) with 95% CI on the relative delta

scenario candidate baseline Δ (95% CI) n cand n base
fightertest-bots 23.64 ms 23.94 ms ♻️ $\color{green}{-1.44\%} \text{ to } \color{green}{-1.06\%}$ 50 50
fightertest-aircraft 19.21 ms 19.18 ms ♻️ $\color{red}{+0.08\%} \text{ to } \color{red}{+0.22\%}$ 50 50
fightertest-tanks 24.91 ms 24.90 ms ♻️ $\color{green}{-0.30\%} \text{ to } \color{red}{+0.38\%}$ 50 50
fightertest-pathfinding 21.69 ms 21.82 ms ♻️ $\color{green}{-0.80\%} \text{ to } \color{green}{-0.34\%}$ 50 95
lategame1 21.60 ms 23.14 ms ♻️ $\color{green}{-7.88\%} \text{ to } \color{green}{-5.46\%}$ 20 50
Per-VM distribution box plots (5)

fightertest-bots

fightertest-aircraft

fightertest-tanks

fightertest-pathfinding

lategame1

💰 compute cost: $1.35 · 5 fresh legs · 5 cached at $0 last updated: 2026-04-27T12:58:25.156Z · [workflow run](https://github.com/bruno-dasilva/RecoilEngine/actions/runs/24995317261)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants