Skip CWeapon piece-transform reads when generation matches #21
Open
bruno-dasilva wants to merge 10 commits into
Open
Skip CWeapon piece-transform reads when generation matches #21bruno-dasilva wants to merge 10 commits into
bruno-dasilva wants to merge 10 commits into
Conversation
* Optimize LocalModelPiece. Move some cold data to pointers * Apply optimized loop over dirty pieces (only recalculate relative transforms where changed)
UpdateModelSpaceTransform was writing a full 64-byte CMatrix44f per piece per tick to a field only consumed by render/collision/Lua read paths. At ~10k mostly-moving units this was ~6.4 MB/frame of pure store traffic on the sim hot path, contributing materially to L3 misses. modelSpaceMat is fully derivable from modelSpaceTra. Track a matDirty flag and refill the matrix lazily inside GetModelSpaceMatrix() the first time it is read after modelSpaceTra changes. All four mutation sites (UpdateModelSpaceTransform x2, UpdateChildTransformRec, UpdateParentMatricesRec) now flip matDirty instead of computing the matrix. Public API is unchanged; all external consumers already go through the getter. creg keeps modelSpaceMat as CR_MEMBER for save-format compatibility, and PostLoad sets matDirty=true so the first read after load recomputes from the (also-serialised) modelSpaceTra. Sync semantics are preserved - modelSpaceMat is a derived cache, never ground-truth state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the deque-driven BFS with a linear iteration over the contiguous LocalModel::pieces vector. Pieces are emplace_back'd in pre-order DFS by LocalModel::CreateLocalModelPieces, so a parent's modelSpaceTra is always up-to-date by the time its child reads it via the pointer overload of UpdateModelSpaceTransform - no need to thread the parent transform through a deque. Wins: - HW prefetcher streams pieces[] (was: deque-pop killed prefetch). - Drops the 40-byte (LocalModelPiece*, Transform) deque element copy per visit and the thread_local std::deque allocator traffic. - children[] is no longer touched in the hot path, freeing a cacheline of pressure per piece. Lock the parent-index < child-index invariant in LocalModelPiece::AddChild with an assert so future construction-time changes can't silently break the ordering the linear sweep depends on. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reshuffle declaration order so the fields touched by the per-tick animation sweep pack into the first two cachelines (offsets 0-128). Cold fields (modelSpaceMat lazily-filled per the prior commit, prevModelSpaceTra, dir, script/Lua-piece indices, localModel, colvol, children, lodDispLists) move past offset 128. Per-visit hot footprint drops from ~5 spread cachelines to 2 contiguous ones. At ~10k-units-moving × ~10 pieces, the streaming working set goes from roughly 32 MB (over typical L3) to ~13 MB (within L3 of most modern desktops). No struct split, no API change, no creg metadata change - creg looks up fields via offsetof at registration, so reorder is transparent to the save format. Multiple private:/public: blocks keep external code source- compatible regardless of declaration order. Constructor init lists in both ctors are reordered to match the new declaration order, also clearing the pre-existing -Wreorder-ctor warnings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CSolidObject::UpdatePrevFrameTransform was constructing a full CMatrix44f via GetTransformMatrix(true) only for CQuaternion::MakeFrom to read the 3x3 rotation part. Add CQuaternion::FromAxes(x, y, z) that runs the same trace-branch math on the basis columns directly, and route MakeFrom on a matrix through it. Skips the matrix round-trip for every active unit and feature each sim frame. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CModelDrawerDataBase::UpdateObjectTrasform was building tmCurr via Transform::FromMatrix(GetTransformMatrix(true)), which constructs a full CMatrix44f and then runs DecomposeIntoTRS (Det3 + 3 column lengths + 3 column divisions + MakeFrom). Both CUnit and CFeature paths produce a matrix shaped like ComposeMatrix(pos), so we can build the same Transform directly from (-rightdir, updir, frontdir, pos) via CQuaternion::FromAxes. Runs once per visible model per render frame. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CSolidObject::UpdatePrevFrameTransform ran the per-piece prevModelSpaceTra copy and rebuilt preFrameTra unconditionally for every active unit and feature each sim frame, even when the object hadn't moved or animated. That's the common case in an RTS battle (parked vehicles, finished wreckage, idle gathered units, all buildings). Add a per-object preFrameDirty flag, set at the choke points that mutate pos/dirs (Move, ForcedSpin, UpdateDirVectors, SetDirVectors) and propagated from per-piece script setters (LocalModelPiece::SetFloat3 / SetFloat) via a new LocalModel→CSolidObject back-pointer. Cleared at the end of UpdatePrevFrameTransform; when clean we early-out before touching either the piece loop or the FromAxes rebuild. Measured ~20% faster Sim::Unit::UpdatePreFrame and ~50% faster Sim::Features::UpdatePreFrame on a representative scene. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
UpdateWeaponVectors runs MT on every active unit each sim frame and re-resolves aimFromPiece/muzzlePiece script-piece indices through SafeGetPiece on every call. Both the indices and the script's pieces[] array only change at script init/state changes, so the indirection is pure overhead. Cache the resolved LocalModelPiece* on the weapon and rebind from the existing piece-update sites: CWeapon::UpdateWeaponPieces (collapsed to a single trailing rebind), and CCobInstance::MapScriptToModelPieces / CLuaUnitScript ctor + PostLoad / CNullUnitScript::PostLoad via a new CUnitScript::RebindWeaponPieceCaches helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After plan 1, UpdateWeaponVectors still calls GetAbsolutePos and GetEmitDirPos on the cached LocalModelPieces every frame for every weapon. For most weapons most frames the piece's modelSpaceTra hasn't changed (the unit may move in world space, but the muzzle/aim piece is static relative to the unit body), so the quat-rotate arithmetic inside GetEmitDirPos is wasted work. Add a uint32_t modelSpaceTraGen counter on LocalModelPiece, bump it at every modelSpaceTra write, and have CWeapon track the last-seen gen per cached piece. When the gen matches, skip the script-piece read and reuse the cached relAimFromPos / relWeaponMuzzlePos / relWeaponDir. The world-space conversion (GetObjectSpacePos / GetObjectSpaceVec) and popup ground check still run unconditionally, so moving units stay correct. relWeaponDir is split out from weaponDir (which used to be aliased as both the script-emit-dir output and the world-space output) and added to creg, the constructor init list, and SaveWeaponVectors / LoadWeaponVectors so save/load and TestTarget round-trip cleanly. ReBindLocalModelPieces resets both gen sentinels to ~0u so script reloads force a re-read. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Owner
Author
bar-benchmark — PR #21candidate sim trimmed mean (ms) with 95% CI on the relative delta
|
6ae3d38 to
eff5a00
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.







No description provided.