Skip to content

Jeff/fix/rosnav8#1791

Draft
jeff-hykin wants to merge 186 commits intodevfrom
jeff/fix/rosnav8
Draft

Jeff/fix/rosnav8#1791
jeff-hykin wants to merge 186 commits intodevfrom
jeff/fix/rosnav8

Conversation

@jeff-hykin
Copy link
Copy Markdown
Member

@jeff-hykin jeff-hykin commented Apr 16, 2026

Pre-requisite PRs

Will be perpetually re-testing hardware after merges/changes to ^

631af1e tested for 2 hours on the g1 (entire battery! instantly responsive after 2hrs)
c54578a044f892fe7f231b468f2f3512adf4282a (later) hardware tested and sim tested.

Problem

Robot blueprints each duplicated viewer backend selection logic (foxglove vs rerun vs rerun-web vs rerun-connect), making it error-prone to add new backends or change behavior. The RerunBridge also lacked a WebSocket server for dimos-viewer keyboard/click events in --connect mode.

Solution

  • port over the whole nav stack

Breaking Changes

  • None

How to Test

git checkout jeff/fix/rosnav8

# last hardware tested commit
git checkout c54578a044f892fe7f231b468f2f3512adf4282a

# simulation
uv run dimos run unitree-g1-nav-sim

# real
uv run dimos run unitree-g1-nav-onboard

Contributor License Agreement

  • I have read and approved the CLA.

Twelve fixes from Paul-style code review of #1780:

native_module.py:
- _maybe_build: use did_change(update=False) + post-success update_cache so
  failed builds never poison the rebuild cache
- delete stray "Source files changed" log line that fired on every call
- stop(): per-instance _stop_lock; capture proc/watchdog refs under the
  lock, do signal/wait/join outside the lock to avoid the watchdog-self-
  join deadlock; second caller no-ops via _stopping
- _read_log_stream: receive pid as a parameter instead of reaching into
  self._process.pid (TOCTOU with stop())
- _child_preexec_linux + ctypes hoisted to module scope under a
  sys.platform guard; no more re-import per start()
- _clear_nix_executable: gracefully handle cwd=None (skip the walk
  entirely), use .resolve() comparison so a symlinked cwd still
  terminates the walk, refuse to walk past cwd's tree

change_detect.py:
- fold max_file_size into _hash_files digest so different thresholds
  against the same cache_name can't corrupt each other
- new _locked_cache_file context manager — flock the .hash file directly
  in "a+" mode; no more orphan .lock sidecars accumulating in the cache
  dir; did_change/update_cache/clear_cache all share the helper

Tests:
- new test_should_rebuild_true_bypasses_change_check for the explicit
  "always rebuild" path
- new test_failed_build_does_not_mark_cache_clean for the update=False
  retry semantics

Build system:
- tiledb darwin overlay: filter patches by name instead of dropping all
  of them, so a future security patch from nixpkgs survives
- livox_sdk_config.hpp: honor $TMPDIR on Darwin instead of hardcoding /tmp

Perf script:
- compute cache_name inline (using the same inspect-based source_file
  pattern) instead of calling the inlined _build_cache_name method

All 26 tests across test_change_detect.py + test_native_module.py +
test_native_rebuild.py pass. ruff + mypy clean. mid360_native and
fastlio2_native still build end-to-end on aarch64-darwin. Linux drvPaths
for libpqxx/tiledb/pcl/vtk verified unchanged by the overlay.
jeff-hykin and others added 24 commits April 27, 2026 22:32
The class name already says "config for extended terrain map" — the
docstring added no information beyond the name.
Both removed strings restated the identifier without adding info:
- TUIControlConfig: "Configuration for the TUI controller."
- TUIControlModule._handle_key: "Process a single keypress."
Per critic audit (commit 0171415 claimed reverts that didn't actually
land). Real reverts now applied — verified per file by checking whether
any nav_stack code imports/uses the changed API.

Reverted to origin/dev:
- dimos/msgs/geometry_msgs/PoseWithCovariance.py — only test_Odometry.py
  uses it (not nav_stack); copy-constructor consolidation isn't needed.
- dimos/robot/unitree/go2/blueprints/smart/unitree_go2.py — wrong robot.
- dimos/visualization/rerun/constants.py — RERUN_WEB_PORT change is
  unrelated to nav_stack.
- pyproject.toml: dropped xxhash>=3.0.0 (unused) and the duplicate
  open3d-unofficial-arm without platform marker. Kept the navigation
  extra (gtsam-extended for PGO), the pytest override-dep, and the
  ruff main.cpp exclude patterns.
- uv.lock regenerated against pruned pyproject.toml.

Re-deleted (had been restored by my earlier blanket checkout):
- dimos/robot/position_stream.py
- dimos/robot/ros_command_queue.py
- dimos/robot/unitree/rosnav.py

Kept (verified in-scope):
- dimos/core/global_config.py — `build_native` field is consumed by
  native_module.py which all nav_stack native modules inherit from.
- dimos/core/native_module.py — used by terrain_analysis, far_planner,
  local_planner, path_follower, tare_planner, fastlio2.
- dimos/msgs/nav_msgs/{ContourPolygons3D,GraphNodes3D,LineSegments3D}.py
  — imported by far_planner.
- dimos/robot/config.py — G1 config (used by nav_stack's blueprint
  test) requires the new optional `end_effector_link` field. Reverting
  this broke test_g1_nav_sim_blueprint_importable.
- dimos/simulation/unity/module.py — used by test_unity_sim,
  test_sim_pipeline, test_cross_wall_planning{,_simple}.

test_sim_pipeline.py: tui_control was deleted from the branch upstream
(commit 08c908e). Removed the dangling
`test_tui_publishes_twist_via_transport` test that imported the now-
gone module.
Just restated the function name and return type annotation.
Just restated the file's purpose ("Unit tests for Costmap + A* used by
SimplePlanner") — already obvious from the path and what's tested.
Same description already lives in the module-level docstring above.
Just paraphrased the file's name and the MovementManager's own docstring.
Per project python-style rule (no magic numbers). Audited
``dimos/navigation/nav_stack/`` production code (excluded tests + auto-
generated lcm files); pydantic field defaults like ``sensor_range:
float = 20.0`` were left alone since the field name is the explanation.

Changes:

- ``modules/movement_manager/movement_manager.py``: extract
  ``MAX_CLICK_HORIZONTAL_M = 500.0`` and ``MAX_CLICK_VERTICAL_M = 50.0``
  module-level constants for the click-bounds sanity check that was
  ``if abs(msg.x) > 500 or abs(msg.y) > 500 or abs(msg.z) > 50:``.

- ``modules/pgo/pgo.py``: extract ``_MIN_ICP_INLIERS = 10`` (used by
  ``_icp`` to bail when the correspondence count is too low to
  constrain the alignment) and ``_MIN_KEYFRAMES_FOR_LOOP_SEARCH = 10``
  (used by ``_SimplePGO.search_for_loops``).

- ``main.py::_floor_quad`` (the ground-plane mesh helper): renamed
  ``s`` / ``z`` locals to ``half_size`` / ``z_below_ground``, and
  pulled the ``[40, 40, 40, 120]`` RGBA value out into a named
  ``floor_color_rgba`` local.

Color-ramp values in ``_global_map_override`` and ``_terrain_map_override``
(``30 + z_norm * 30`` etc.) left as-is — they're already documented by
the explicit ``Low z = (30, 80, 200) → High z = (60, 220, 100)``
comments above each block.

Config-dict values being forwarded to sub-modules (e.g.
``"voxel_point_update_threshold": 100``) left as-is — the dict key is
the explanation, same as a pydantic field default.
Class docstrings like "Test ICP matching functionality" and method
docstrings like "PGO module should declare the right input/output ports"
just paraphrased the identifiers above them.

Removed:
- TestICP class docstring
- TestEdgeCases class docstring
- TestPGOWrapper class docstring
- test_pgo_module_has_correct_ports docstring
- test_pgo_config_defaults docstring
Removed class and method docstrings that just paraphrased the identifiers
(e.g. "Test that PGO produces a valid global map from keyframes" on
TestGlobalMapAccumulation). Kept docstrings that add specifics about
assertions or inputs.
Co-authored-by: Paul Nechifor <paul@nechifor.net>
@jeff-hykin jeff-hykin marked this pull request as ready for review April 28, 2026 06:41
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 28, 2026

Greptile Summary

This PR ports the full CMU SmartNav stack into the dimos framework: a composable create_nav_stack() factory replaces per-blueprint viewer-backend duplication, adding TerrainAnalysis, LocalPlanner, PathFollower, PGO (GTSAM iSAM2 loop closure), SimplePlanner (pure-Python A*), FarPlanner, and MovementManager. Two new G1 blueprints (unitree-g1-nav-onboard, unitree-g1-nav-sim) exercise the stack on real hardware and Unity sim respectively, with correct raw-vs-corrected odometry routing per CMU ICRA 2022 Fig. 11.

  • P1 (dds_sdk.py stand_up): if _get_fsm_id() returns None (DDS query failure), the stale variable makes None != FsmState.AI_MODE evaluate to True, issuing SetFsmId(AI_MODE) to the physical robot in an unknown state with no guard.

Confidence Score: 3/5

P1 in stand_up() could issue an unsolicited AI-mode command to the physical robot when DDS state query fails; safe to merge for sim, needs fix before onboard use.

One P1 (unconditional SetFsmId on None FSM ID in physical robot control) pulls the ceiling to 4; combined with the scope of the change (7000+ line net addition, new nav stack, new hardware SDK, new sim auto-download) a score of 3 is appropriate.

dimos/robot/unitree/g1/effectors/high_level/dds_sdk.py — stand_up() null-fsm guard

Important Files Changed

Filename Overview
dimos/robot/unitree/g1/effectors/high_level/dds_sdk.py New G1 high-level DDS SDK module; stale fsm_id in stand_up() can issue AI-mode command to robot when state is unknown (P1)
dimos/navigation/nav_stack/main.py New composable nav stack factory (create_nav_stack) wiring TerrainAnalysis, LocalPlanner, PathFollower, FAR/Simple planner, PGO, and MovementManager with correct corrected-vs-raw odometry routing
dimos/navigation/nav_stack/modules/simple_planner/simple_planner.py New pure-Python A* global planner with costmap, inflation, stuck detection, and lookahead; directly accesses private Costmap internals in _classify_points (P2)
dimos/navigation/nav_stack/modules/pgo/pgo.py New Python PGO module with GTSAM iSAM2, ICP loop closure, voxel downsampling, and keyframe management; implementation looks correct
dimos/core/native_module.py Refactored NativeModule with improved stop/watchdog locking, prctl(PR_SET_PDEATHSIG) on Linux, auto-build toggle, and cli_name_override for camelCase CLI args
dimos/simulation/unity/module.py Added Google Drive scene auto-download (gdown.download_folder), terrain inclination fitting config, and sensor offsets; downloads entire Drive folder instead of target scene only (P2)
dimos/robot/unitree/g1/blueprints/navigation/unitree_g1_nav_onboard.py New onboard navigation blueprint composing FastLio2, nav stack, G1HighLevelDdsSdk, and vis_module with correct odometry remappings
dimos/robot/unitree/g1/blueprints/navigation/unitree_g1_nav_sim.py New sim navigation blueprint composing UnityBridgeModule, nav stack with SimplePlanner, and vis_module; terrain_map→terrain_map_ext remapping for Unity Z-height is correct
dimos/robot/config.py Made end_effector_link optional with a clear error on manipulation use; added height_clearance, width_clearance, and internal_odom_offsets fields; moved imports to top of file

Sequence Diagram

sequenceDiagram
    participant UI as dimos-viewer / Foxglove
    participant MM as MovementManager
    participant SP as SimplePlanner / FarPlanner
    participant LP as LocalPlanner (C++)
    participant PF as PathFollower (C++)
    participant PGO as PGO (GTSAM)
    participant TA as TerrainAnalysis (C++)
    participant HW as G1HighLevelDdsSdk / Unity

    UI->>MM: clicked_point
    MM->>SP: goal (corrected_odometry frame)
    PGO-->>MM: corrected_odometry
    PGO-->>TA: corrected_odometry
    PGO-->>SP: corrected_odometry
    SP->>LP: way_point (map frame)
    TA->>LP: terrain_map (obstacles)
    LP->>PF: path (raw odom frame)
    PF->>MM: nav_cmd_vel
    MM->>HW: cmd_vel
    HW-->>PGO: odometry (raw SLAM)
    HW-->>LP: registered_scan
    PGO->>PGO: loop closure (ICP + iSAM2)
    PGO-->>UI: corrected_odometry + global_map
Loading

Reviews (1): Last reviewed commit: "test: hoist remaining inline ros1 import..." | Re-trigger Greptile

Comment on lines +269 to +276
fsm_id = self._get_fsm_id()
if fsm_id == FsmState.ZERO_TORQUE:
logger.info("Robot in zero torque, enabling damp mode...")
self.loco_client.SetFsmId(FsmState.DAMP)
time.sleep(self._standup_step_delay / 3)
if fsm_id != FsmState.AI_MODE:
logger.info("Starting AI mode...")
self.loco_client.SetFsmId(FsmState.AI_MODE)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Stale fsm_id causes unconditional AI-mode command on unknown state

fsm_id is fetched once at line 269. If _get_fsm_id() returns None (DDS query failure), the comparison None != FsmState.AI_MODE evaluates to True, so SetFsmId(FsmState.AI_MODE) is issued to the robot even when its current state is completely unknown — bypassing the ZERO_TORQUE → DAMP guard. Additionally, when the state is ZERO_TORQUE, after setting DAMP the stale fsm_id (still ZERO_TORQUE) causes the code to immediately issue another SetFsmId(AI_MODE) without re-querying, leaving no opportunity to confirm the transition succeeded.

Suggested change
fsm_id = self._get_fsm_id()
if fsm_id == FsmState.ZERO_TORQUE:
logger.info("Robot in zero torque, enabling damp mode...")
self.loco_client.SetFsmId(FsmState.DAMP)
time.sleep(self._standup_step_delay / 3)
if fsm_id != FsmState.AI_MODE:
logger.info("Starting AI mode...")
self.loco_client.SetFsmId(FsmState.AI_MODE)
fsm_id = self._get_fsm_id()
if fsm_id is None:
logger.warning("Could not read FSM ID; aborting stand_up to avoid unsafe state transition")
return False
if fsm_id == FsmState.ZERO_TORQUE:
logger.info("Robot in zero torque, enabling damp mode...")
self.loco_client.SetFsmId(FsmState.DAMP)
time.sleep(self._standup_step_delay / 3)
fsm_id = self._get_fsm_id() or FsmState.DAMP
if fsm_id != FsmState.AI_MODE:

Comment on lines +490 to +520
mask = heights > 0.0
if not np.any(mask):
return
xs = points[mask, 0]
ys = points[mask, 1]
hs = heights[mask]
cell_size = cm.cell_size
ixs = np.floor(xs / cell_size).astype(np.int64)
iys = np.floor(ys / cell_size).astype(np.int64)
# Group by cell and take max height per cell (vectorized).
keys = np.column_stack((ixs, iys))
_, inverse, counts = np.unique(keys, axis=0, return_inverse=True, return_counts=True)
max_h = np.full(len(counts), float("-inf"))
np.maximum.at(max_h, inverse, hs)
unique_keys = keys[np.unique(inverse, return_index=True)[1]]
dirty = False
for i in range(len(unique_keys)):
key = (int(unique_keys[i, 0]), int(unique_keys[i, 1]))
h = float(max_h[i])
prev = cm._heights.get(key, float("-inf"))
if h > prev:
cm._heights[key] = h
dirty = True
if dirty:
cm._blocked_dirty = True

def _fresh_costmap(self) -> Costmap:
return Costmap(
cell_size=self.config.cell_size,
obstacle_height=self.config.obstacle_height_threshold,
inflation_radius=self.config.inflation_radius,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 _classify_points bypasses Costmap.update() and accesses private state

_classify_points writes directly to cm._heights and sets cm._blocked_dirty instead of calling the public cm.update(x, y, height) method. This tightly couples the caller to internal Costmap layout — any future change to how Costmap tracks heights or invalidates the blocked set will silently break the vectorized path here. The public API (update) already does the same max-over-history tracking in a single dict lookup.

Comment on lines +183 to +193

if not zip_path.exists():
raise FileNotFoundError(
f"Failed to download scene '{scene}'. "
f"Check https://drive.google.com/drive/folders/{_GDRIVE_FOLDER_ID}"
)

extract_dir = dest_dir / scene
if not extract_dir.exists():
logger.info(f"Extracting {zip_path}...")
with zipfile.ZipFile(zip_path, "r") as zf:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 gdown.download_folder downloads entire folder instead of target scene

gdown.download_folder(id=_GDRIVE_FOLDER_ID, ...) fetches every file in the Google Drive folder, not just the requested scene. If the folder contains multiple large scene zips, this silently downloads them all before locating the one that was actually needed. Consider downloading by individual file ID or listing the folder first and filtering to the matching scene zip.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants