Skip to content

Cache resolved LocalModelPiece* on CWeapon #20

Open
bruno-dasilva wants to merge 9 commits into
masterfrom
bruno/try-optimize-weapons
Open

Cache resolved LocalModelPiece* on CWeapon #20
bruno-dasilva wants to merge 9 commits into
masterfrom
bruno/try-optimize-weapons

Conversation

@bruno-dasilva

Copy link
Copy Markdown
Owner

No description provided.

lhog and others added 9 commits April 22, 2026 05:30
* Optimize LocalModelPiece. Move some cold data to pointers
* Apply optimized loop over dirty pieces (only recalculate relative transforms where changed)
UpdateModelSpaceTransform was writing a full 64-byte CMatrix44f per piece per
tick to a field only consumed by render/collision/Lua read paths. At ~10k
mostly-moving units this was ~6.4 MB/frame of pure store traffic on the sim
hot path, contributing materially to L3 misses.

modelSpaceMat is fully derivable from modelSpaceTra. Track a matDirty flag
and refill the matrix lazily inside GetModelSpaceMatrix() the first time it
is read after modelSpaceTra changes. All four mutation sites
(UpdateModelSpaceTransform x2, UpdateChildTransformRec, UpdateParentMatricesRec)
now flip matDirty instead of computing the matrix.

Public API is unchanged; all external consumers already go through the
getter. creg keeps modelSpaceMat as CR_MEMBER for save-format compatibility,
and PostLoad sets matDirty=true so the first read after load recomputes
from the (also-serialised) modelSpaceTra. Sync semantics are preserved -
modelSpaceMat is a derived cache, never ground-truth state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the deque-driven BFS with a linear iteration over the contiguous
LocalModel::pieces vector. Pieces are emplace_back'd in pre-order DFS by
LocalModel::CreateLocalModelPieces, so a parent's modelSpaceTra is always
up-to-date by the time its child reads it via the pointer overload of
UpdateModelSpaceTransform - no need to thread the parent transform through
a deque.

Wins:
- HW prefetcher streams pieces[] (was: deque-pop killed prefetch).
- Drops the 40-byte (LocalModelPiece*, Transform) deque element copy per
  visit and the thread_local std::deque allocator traffic.
- children[] is no longer touched in the hot path, freeing a cacheline of
  pressure per piece.

Lock the parent-index < child-index invariant in LocalModelPiece::AddChild
with an assert so future construction-time changes can't silently break the
ordering the linear sweep depends on.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reshuffle declaration order so the fields touched by the per-tick animation
sweep pack into the first two cachelines (offsets 0-128). Cold fields
(modelSpaceMat lazily-filled per the prior commit, prevModelSpaceTra, dir,
script/Lua-piece indices, localModel, colvol, children, lodDispLists) move
past offset 128.

Per-visit hot footprint drops from ~5 spread cachelines to 2 contiguous
ones. At ~10k-units-moving × ~10 pieces, the streaming working set goes
from roughly 32 MB (over typical L3) to ~13 MB (within L3 of most modern
desktops).

No struct split, no API change, no creg metadata change - creg looks up
fields via offsetof at registration, so reorder is transparent to the save
format. Multiple private:/public: blocks keep external code source-
compatible regardless of declaration order.

Constructor init lists in both ctors are reordered to match the new
declaration order, also clearing the pre-existing -Wreorder-ctor warnings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CSolidObject::UpdatePrevFrameTransform was constructing a full CMatrix44f
via GetTransformMatrix(true) only for CQuaternion::MakeFrom to read the
3x3 rotation part. Add CQuaternion::FromAxes(x, y, z) that runs the same
trace-branch math on the basis columns directly, and route MakeFrom on a
matrix through it. Skips the matrix round-trip for every active unit and
feature each sim frame.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CModelDrawerDataBase::UpdateObjectTrasform was building tmCurr via
Transform::FromMatrix(GetTransformMatrix(true)), which constructs a full
CMatrix44f and then runs DecomposeIntoTRS (Det3 + 3 column lengths +
3 column divisions + MakeFrom). Both CUnit and CFeature paths produce a
matrix shaped like ComposeMatrix(pos), so we can build the same
Transform directly from (-rightdir, updir, frontdir, pos) via
CQuaternion::FromAxes. Runs once per visible model per render frame.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CSolidObject::UpdatePrevFrameTransform ran the per-piece
prevModelSpaceTra copy and rebuilt preFrameTra unconditionally for every
active unit and feature each sim frame, even when the object hadn't
moved or animated. That's the common case in an RTS battle (parked
vehicles, finished wreckage, idle gathered units, all buildings).

Add a per-object preFrameDirty flag, set at the choke points that mutate
pos/dirs (Move, ForcedSpin, UpdateDirVectors, SetDirVectors) and
propagated from per-piece script setters (LocalModelPiece::SetFloat3 /
SetFloat) via a new LocalModel→CSolidObject back-pointer. Cleared at
the end of UpdatePrevFrameTransform; when clean we early-out before
touching either the piece loop or the FromAxes rebuild.

Measured ~20% faster Sim::Unit::UpdatePreFrame and ~50% faster
Sim::Features::UpdatePreFrame on a representative scene.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
UpdateWeaponVectors runs MT on every active unit each sim frame and
re-resolves aimFromPiece/muzzlePiece script-piece indices through
SafeGetPiece on every call. Both the indices and the script's pieces[]
array only change at script init/state changes, so the indirection is
pure overhead.

Cache the resolved LocalModelPiece* on the weapon and rebind from the
existing piece-update sites: CWeapon::UpdateWeaponPieces (collapsed to
a single trailing rebind), and CCobInstance::MapScriptToModelPieces /
CLuaUnitScript ctor + PostLoad / CNullUnitScript::PostLoad via a new
CUnitScript::RebindWeaponPieceCaches helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bruno-dasilva

Copy link
Copy Markdown
Owner Author
Screenshot 2026-04-27 at 4 50 47 AM Screenshot 2026-04-27 at 4 51 18 AM

@github-actions

Copy link
Copy Markdown

bar-benchmark — PR #20

candidate 84525ac vs baseline eb1c69f

sim trimmed mean (ms) with 95% CI on the relative delta

scenario candidate baseline Δ (95% CI) n cand n base
fightertest-bots 23.61 ms 23.94 ms ♻️ $\color{green}{-1.54\%} \text{ to } \color{green}{-1.14\%}$ 50 50
fightertest-aircraft 19.25 ms 19.18 ms ♻️ $\color{red}{+0.27\%} \text{ to } \color{red}{+0.48\%}$ 50 50
fightertest-tanks 24.93 ms 24.90 ms ♻️ $\color{green}{-0.20\%} \text{ to } \color{red}{+0.44\%}$ 50 50
fightertest-pathfinding 21.66 ms 21.82 ms ♻️ $\color{green}{-0.96\%} \text{ to } \color{green}{-0.54\%}$ 50 95
lategame1 21.89 ms 23.14 ms ♻️ $\color{green}{-6.99\%} \text{ to } \color{green}{-3.97\%}$ 20 50
Per-VM distribution box plots (5)

fightertest-bots

fightertest-aircraft

fightertest-tanks

fightertest-pathfinding

lategame1

💰 compute cost: $1.35 · 5 fresh legs · 5 cached at $0 last updated: 2026-04-27T12:37:28.102Z · [workflow run](https://github.com/bruno-dasilva/RecoilEngine/actions/runs/24994370850)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants