Jeff/fix/rosnav8#1791
Conversation
…nto jeff/fix/rosnav8
…nto jeff/fix/rosnav8
…dimos into jeff/fix/rosnav8
Twelve fixes from Paul-style code review of #1780: native_module.py: - _maybe_build: use did_change(update=False) + post-success update_cache so failed builds never poison the rebuild cache - delete stray "Source files changed" log line that fired on every call - stop(): per-instance _stop_lock; capture proc/watchdog refs under the lock, do signal/wait/join outside the lock to avoid the watchdog-self- join deadlock; second caller no-ops via _stopping - _read_log_stream: receive pid as a parameter instead of reaching into self._process.pid (TOCTOU with stop()) - _child_preexec_linux + ctypes hoisted to module scope under a sys.platform guard; no more re-import per start() - _clear_nix_executable: gracefully handle cwd=None (skip the walk entirely), use .resolve() comparison so a symlinked cwd still terminates the walk, refuse to walk past cwd's tree change_detect.py: - fold max_file_size into _hash_files digest so different thresholds against the same cache_name can't corrupt each other - new _locked_cache_file context manager — flock the .hash file directly in "a+" mode; no more orphan .lock sidecars accumulating in the cache dir; did_change/update_cache/clear_cache all share the helper Tests: - new test_should_rebuild_true_bypasses_change_check for the explicit "always rebuild" path - new test_failed_build_does_not_mark_cache_clean for the update=False retry semantics Build system: - tiledb darwin overlay: filter patches by name instead of dropping all of them, so a future security patch from nixpkgs survives - livox_sdk_config.hpp: honor $TMPDIR on Darwin instead of hardcoding /tmp Perf script: - compute cache_name inline (using the same inspect-based source_file pattern) instead of calling the inlined _build_cache_name method All 26 tests across test_change_detect.py + test_native_module.py + test_native_rebuild.py pass. ruff + mypy clean. mid360_native and fastlio2_native still build end-to-end on aarch64-darwin. Linux drvPaths for libpqxx/tiledb/pcl/vtk verified unchanged by the overlay.
The class name already says "config for extended terrain map" — the docstring added no information beyond the name.
Both removed strings restated the identifier without adding info: - TUIControlConfig: "Configuration for the TUI controller." - TUIControlModule._handle_key: "Process a single keypress."
Per critic audit (commit 0171415 claimed reverts that didn't actually land). Real reverts now applied — verified per file by checking whether any nav_stack code imports/uses the changed API. Reverted to origin/dev: - dimos/msgs/geometry_msgs/PoseWithCovariance.py — only test_Odometry.py uses it (not nav_stack); copy-constructor consolidation isn't needed. - dimos/robot/unitree/go2/blueprints/smart/unitree_go2.py — wrong robot. - dimos/visualization/rerun/constants.py — RERUN_WEB_PORT change is unrelated to nav_stack. - pyproject.toml: dropped xxhash>=3.0.0 (unused) and the duplicate open3d-unofficial-arm without platform marker. Kept the navigation extra (gtsam-extended for PGO), the pytest override-dep, and the ruff main.cpp exclude patterns. - uv.lock regenerated against pruned pyproject.toml. Re-deleted (had been restored by my earlier blanket checkout): - dimos/robot/position_stream.py - dimos/robot/ros_command_queue.py - dimos/robot/unitree/rosnav.py Kept (verified in-scope): - dimos/core/global_config.py — `build_native` field is consumed by native_module.py which all nav_stack native modules inherit from. - dimos/core/native_module.py — used by terrain_analysis, far_planner, local_planner, path_follower, tare_planner, fastlio2. - dimos/msgs/nav_msgs/{ContourPolygons3D,GraphNodes3D,LineSegments3D}.py — imported by far_planner. - dimos/robot/config.py — G1 config (used by nav_stack's blueprint test) requires the new optional `end_effector_link` field. Reverting this broke test_g1_nav_sim_blueprint_importable. - dimos/simulation/unity/module.py — used by test_unity_sim, test_sim_pipeline, test_cross_wall_planning{,_simple}. test_sim_pipeline.py: tui_control was deleted from the branch upstream (commit 08c908e). Removed the dangling `test_tui_publishes_twist_via_transport` test that imported the now- gone module.
Just restated the function name and return type annotation.
Just restated the file's purpose ("Unit tests for Costmap + A* used by
SimplePlanner") — already obvious from the path and what's tested.
Same description already lives in the module-level docstring above.
Just paraphrased the file's name and the MovementManager's own docstring.
Per project python-style rule (no magic numbers). Audited ``dimos/navigation/nav_stack/`` production code (excluded tests + auto- generated lcm files); pydantic field defaults like ``sensor_range: float = 20.0`` were left alone since the field name is the explanation. Changes: - ``modules/movement_manager/movement_manager.py``: extract ``MAX_CLICK_HORIZONTAL_M = 500.0`` and ``MAX_CLICK_VERTICAL_M = 50.0`` module-level constants for the click-bounds sanity check that was ``if abs(msg.x) > 500 or abs(msg.y) > 500 or abs(msg.z) > 50:``. - ``modules/pgo/pgo.py``: extract ``_MIN_ICP_INLIERS = 10`` (used by ``_icp`` to bail when the correspondence count is too low to constrain the alignment) and ``_MIN_KEYFRAMES_FOR_LOOP_SEARCH = 10`` (used by ``_SimplePGO.search_for_loops``). - ``main.py::_floor_quad`` (the ground-plane mesh helper): renamed ``s`` / ``z`` locals to ``half_size`` / ``z_below_ground``, and pulled the ``[40, 40, 40, 120]`` RGBA value out into a named ``floor_color_rgba`` local. Color-ramp values in ``_global_map_override`` and ``_terrain_map_override`` (``30 + z_norm * 30`` etc.) left as-is — they're already documented by the explicit ``Low z = (30, 80, 200) → High z = (60, 220, 100)`` comments above each block. Config-dict values being forwarded to sub-modules (e.g. ``"voxel_point_update_threshold": 100``) left as-is — the dict key is the explanation, same as a pydantic field default.
Class docstrings like "Test ICP matching functionality" and method docstrings like "PGO module should declare the right input/output ports" just paraphrased the identifiers above them. Removed: - TestICP class docstring - TestEdgeCases class docstring - TestPGOWrapper class docstring - test_pgo_module_has_correct_ports docstring - test_pgo_config_defaults docstring
Removed class and method docstrings that just paraphrased the identifiers (e.g. "Test that PGO produces a valid global map from keyframes" on TestGlobalMapAccumulation). Kept docstrings that add specifics about assertions or inputs.
Co-authored-by: Paul Nechifor <paul@nechifor.net>
…o jeff/fix/rosnav8
Greptile SummaryThis PR ports the full CMU SmartNav stack into the dimos framework: a composable
Confidence Score: 3/5P1 in stand_up() could issue an unsolicited AI-mode command to the physical robot when DDS state query fails; safe to merge for sim, needs fix before onboard use. One P1 (unconditional SetFsmId on None FSM ID in physical robot control) pulls the ceiling to 4; combined with the scope of the change (7000+ line net addition, new nav stack, new hardware SDK, new sim auto-download) a score of 3 is appropriate. dimos/robot/unitree/g1/effectors/high_level/dds_sdk.py — stand_up() null-fsm guard Important Files Changed
Sequence DiagramsequenceDiagram
participant UI as dimos-viewer / Foxglove
participant MM as MovementManager
participant SP as SimplePlanner / FarPlanner
participant LP as LocalPlanner (C++)
participant PF as PathFollower (C++)
participant PGO as PGO (GTSAM)
participant TA as TerrainAnalysis (C++)
participant HW as G1HighLevelDdsSdk / Unity
UI->>MM: clicked_point
MM->>SP: goal (corrected_odometry frame)
PGO-->>MM: corrected_odometry
PGO-->>TA: corrected_odometry
PGO-->>SP: corrected_odometry
SP->>LP: way_point (map frame)
TA->>LP: terrain_map (obstacles)
LP->>PF: path (raw odom frame)
PF->>MM: nav_cmd_vel
MM->>HW: cmd_vel
HW-->>PGO: odometry (raw SLAM)
HW-->>LP: registered_scan
PGO->>PGO: loop closure (ICP + iSAM2)
PGO-->>UI: corrected_odometry + global_map
Reviews (1): Last reviewed commit: "test: hoist remaining inline ros1 import..." | Re-trigger Greptile |
| fsm_id = self._get_fsm_id() | ||
| if fsm_id == FsmState.ZERO_TORQUE: | ||
| logger.info("Robot in zero torque, enabling damp mode...") | ||
| self.loco_client.SetFsmId(FsmState.DAMP) | ||
| time.sleep(self._standup_step_delay / 3) | ||
| if fsm_id != FsmState.AI_MODE: | ||
| logger.info("Starting AI mode...") | ||
| self.loco_client.SetFsmId(FsmState.AI_MODE) |
There was a problem hiding this comment.
Stale
fsm_id causes unconditional AI-mode command on unknown state
fsm_id is fetched once at line 269. If _get_fsm_id() returns None (DDS query failure), the comparison None != FsmState.AI_MODE evaluates to True, so SetFsmId(FsmState.AI_MODE) is issued to the robot even when its current state is completely unknown — bypassing the ZERO_TORQUE → DAMP guard. Additionally, when the state is ZERO_TORQUE, after setting DAMP the stale fsm_id (still ZERO_TORQUE) causes the code to immediately issue another SetFsmId(AI_MODE) without re-querying, leaving no opportunity to confirm the transition succeeded.
| fsm_id = self._get_fsm_id() | |
| if fsm_id == FsmState.ZERO_TORQUE: | |
| logger.info("Robot in zero torque, enabling damp mode...") | |
| self.loco_client.SetFsmId(FsmState.DAMP) | |
| time.sleep(self._standup_step_delay / 3) | |
| if fsm_id != FsmState.AI_MODE: | |
| logger.info("Starting AI mode...") | |
| self.loco_client.SetFsmId(FsmState.AI_MODE) | |
| fsm_id = self._get_fsm_id() | |
| if fsm_id is None: | |
| logger.warning("Could not read FSM ID; aborting stand_up to avoid unsafe state transition") | |
| return False | |
| if fsm_id == FsmState.ZERO_TORQUE: | |
| logger.info("Robot in zero torque, enabling damp mode...") | |
| self.loco_client.SetFsmId(FsmState.DAMP) | |
| time.sleep(self._standup_step_delay / 3) | |
| fsm_id = self._get_fsm_id() or FsmState.DAMP | |
| if fsm_id != FsmState.AI_MODE: |
| mask = heights > 0.0 | ||
| if not np.any(mask): | ||
| return | ||
| xs = points[mask, 0] | ||
| ys = points[mask, 1] | ||
| hs = heights[mask] | ||
| cell_size = cm.cell_size | ||
| ixs = np.floor(xs / cell_size).astype(np.int64) | ||
| iys = np.floor(ys / cell_size).astype(np.int64) | ||
| # Group by cell and take max height per cell (vectorized). | ||
| keys = np.column_stack((ixs, iys)) | ||
| _, inverse, counts = np.unique(keys, axis=0, return_inverse=True, return_counts=True) | ||
| max_h = np.full(len(counts), float("-inf")) | ||
| np.maximum.at(max_h, inverse, hs) | ||
| unique_keys = keys[np.unique(inverse, return_index=True)[1]] | ||
| dirty = False | ||
| for i in range(len(unique_keys)): | ||
| key = (int(unique_keys[i, 0]), int(unique_keys[i, 1])) | ||
| h = float(max_h[i]) | ||
| prev = cm._heights.get(key, float("-inf")) | ||
| if h > prev: | ||
| cm._heights[key] = h | ||
| dirty = True | ||
| if dirty: | ||
| cm._blocked_dirty = True | ||
|
|
||
| def _fresh_costmap(self) -> Costmap: | ||
| return Costmap( | ||
| cell_size=self.config.cell_size, | ||
| obstacle_height=self.config.obstacle_height_threshold, | ||
| inflation_radius=self.config.inflation_radius, |
There was a problem hiding this comment.
_classify_points bypasses Costmap.update() and accesses private state
_classify_points writes directly to cm._heights and sets cm._blocked_dirty instead of calling the public cm.update(x, y, height) method. This tightly couples the caller to internal Costmap layout — any future change to how Costmap tracks heights or invalidates the blocked set will silently break the vectorized path here. The public API (update) already does the same max-over-history tracking in a single dict lookup.
|
|
||
| if not zip_path.exists(): | ||
| raise FileNotFoundError( | ||
| f"Failed to download scene '{scene}'. " | ||
| f"Check https://drive.google.com/drive/folders/{_GDRIVE_FOLDER_ID}" | ||
| ) | ||
|
|
||
| extract_dir = dest_dir / scene | ||
| if not extract_dir.exists(): | ||
| logger.info(f"Extracting {zip_path}...") | ||
| with zipfile.ZipFile(zip_path, "r") as zf: |
There was a problem hiding this comment.
gdown.download_folder downloads entire folder instead of target scene
gdown.download_folder(id=_GDRIVE_FOLDER_ID, ...) fetches every file in the Google Drive folder, not just the requested scene. If the folder contains multiple large scene zips, this silently downloads them all before locating the one that was actually needed. Consider downloading by individual file ID or listing the folder first and filtering to the matching scene zip.
Pre-requisite PRs
Will be perpetually re-testing hardware after merges/changes to ^
631af1etested for 2 hours on the g1 (entire battery! instantly responsive after 2hrs)c54578a044f892fe7f231b468f2f3512adf4282a(later) hardware tested and sim tested.Problem
Robot blueprints each duplicated viewer backend selection logic (foxglove vs rerun vs rerun-web vs rerun-connect), making it error-prone to add new backends or change behavior. The RerunBridge also lacked a WebSocket server for dimos-viewer keyboard/click events in --connect mode.
Solution
Breaking Changes
How to Test
Contributor License Agreement