Add teleop replay agent for non-interactive CI runs#5507
Add teleop replay agent for non-interactive CI runs#5507rwiltz wants to merge 1 commit intoisaac-sim:developfrom
Conversation
There was a problem hiding this comment.
🤖 Isaac Lab Review Bot
Summary
This PR adds a dedicated CI script (teleop_replay_agent.py) for replaying captured teleoperation sessions, along with an internal XCR replay driver module. The goal is to replace runtime patches that CI previously applied to the interactive teleop_se3_agent.py. The implementation is structurally sound but has several issues around error handling, missing module initialization, and potential race conditions.
Architecture Impact
- New public script:
scripts/environments/teleoperation/teleop_replay_agent.pybecomes a CI entry point - New internal module:
isaaclab_teleop.automation.xcr_replay- but theautomationsubpackage appears to lack an__init__.py, making imports fail - Dependencies: Relies on
isaaclab.devices.openxr.remove_camera_configsandisaaclab.devices.teleop_device_factory.create_teleop_device- need to verify these exist in the codebase - Cross-module: Uses
isaaclab_tasks.utils.parse_env_cfgwhich is a stable public API
Implementation Verdict
Significant concerns — Missing __init__.py will cause import failures, and several edge cases in the simulation loop need attention.
Test Coverage
The PR description claims tests were added ("I have added tests that prove my fix is effective or that my feature works") but no test files are included in the changeset. For a CI automation script, integration tests or at minimum a smoke test would be valuable. The XCR replay driver has complex async behavior that's difficult to test in isolation.
CI Status
No CI checks available yet — cannot verify the code actually runs in the target environment.
Findings
🔴 Critical: source/isaaclab_teleop/isaaclab_teleop/automation/xcr_replay.py — Missing __init__.py for automation subpackage
The new automation subpackage has no __init__.py. The import at teleop_replay_agent.py:108:
from isaaclab_teleop.automation import XcrReplayConfig, start_xcr_replaywill raise ModuleNotFoundError: No module named 'isaaclab_teleop.automation'. You need to create source/isaaclab_teleop/isaaclab_teleop/automation/__init__.py that exports these symbols.
🔴 Critical: scripts/environments/teleoperation/teleop_replay_agent.py:141-145 — Unhandled exception silently breaks loop without cleanup
except Exception as e:
logger.error(f"Error during simulation step: {e}")
breakThis catches ALL exceptions including KeyboardInterrupt (which inherits from BaseException, so actually won't be caught) and programming errors. The break exits the loop but:
- The async replay driver keeps running and will eventually call
post_quit()on an already-closed env - The exception traceback is lost (only message logged, not the full trace)
- No re-raise means CI will exit 0 on simulation errors
Consider: logger.exception("...") to preserve traceback, and potentially raise after cleanup or set a non-zero exit code.
🟡 Warning: scripts/environments/teleoperation/teleop_replay_agent.py:137-140 — action.repeat() may fail on 1D tensors
action = teleop_interface.advance()
...
actions = action.repeat(env.num_envs, 1)If teleop_interface.advance() returns a 1D tensor of shape (action_dim,), then .repeat(env.num_envs, 1) produces shape (action_dim,) not (num_envs, action_dim). The correct pattern for replicating across envs is:
actions = action.unsqueeze(0).repeat(env.num_envs, 1)
# or
actions = action.expand(env.num_envs, -1)Verify the expected shape from advance() — if it's already 2D with batch dim 1, the current code is fine.
🟡 Warning: source/isaaclab_teleop/isaaclab_teleop/automation/xcr_replay.py:105-107 — Polling internal module state is fragile
while xcr_player._xcr_playback_subscription is not None:
logger.debug("XCR replay: waiting for playback subscription to clear")
await asyncio.sleep(5)Relying on the underscore-prefixed _xcr_playback_subscription internal variable is brittle — it could be renamed or removed in future Kit versions. The comment acknowledges this ("public-ish signal") but there's no timeout. If the subscription never clears (bug in xcr_player, crash, etc.), this loops forever. Add a timeout:
timeout = 3600 # 1 hour max
elapsed = 0
while xcr_player._xcr_playback_subscription is not None and elapsed < timeout:
await asyncio.sleep(5)
elapsed += 5
if elapsed >= timeout:
logger.warning("XCR replay: timed out waiting for playback to complete")🟡 Warning: scripts/environments/teleoperation/teleop_replay_agent.py:108-109 — asyncio.ensure_future without a running loop
asyncio.ensure_future(start_xcr_replay(XcrReplayConfig(...)))This requires an asyncio event loop to already be running. Kit's main loop typically provides this, but if called before Kit's async infrastructure is fully initialized, this will raise RuntimeError: no running event loop. The script structure suggests this is called after AppLauncher setup, so it's likely fine, but adding defensive handling or a comment explaining the assumption would help maintainability.
🔵 Improvement: scripts/environments/teleoperation/teleop_replay_agent.py:134-135 — Missing render call when action is None may cause frame drops
if action is None:
env.sim.render()
continueWhen action is None, you render but skip env.step(). This is correct for not advancing physics, but the simulation clock doesn't advance either. Over a 120-second warmup period, this could lead to thousands of render-only iterations. Consider adding a small sleep to avoid busy-spinning:
if action is None:
env.sim.render()
simulation_app.update() # May already be implicit
continue🔵 Improvement: source/isaaclab_teleop/isaaclab_teleop/automation/xcr_replay.py:91 — Instantiating XCRReplayAPI() without storing reference
XCRReplayAPI()This instantiates an object solely for its side effects (registering the replay backend). While this works, it's unclear to future readers. Add a comment or store in a variable with _ prefix to signal intentional discard:
_ = XCRReplayAPI() # Side effect: registers replay backend
Greptile SummaryThis PR adds a permanent, non-interactive CI entry-point (
Confidence Score: 3/5Not safe to merge as-is — the discarded asyncio Future means any failure in the replay coroutine causes the CI job to hang indefinitely, and the missing init.py may prevent the automation subpackage from being imported at all. Three distinct defects affect the core purpose of this change: the asyncio Future is discarded causing silent hangs on failure, the automation subpackage is missing its init.py unlike every sibling package, and the environment is leaked when device creation raises. Both new files need attention: teleop_replay_agent.py for the Future exception handling and env cleanup, and xcr_replay.py for the missing init.py, the unretained XCRReplayAPI reference, and the private-attribute completion poll. Important Files Changed
Sequence DiagramsequenceDiagram
participant CI as CI Pipeline
participant Script as teleop_replay_agent.py
participant Env as Isaac Lab Env
participant Device as handtracking device
participant Driver as start_xcr_replay (async)
participant Kit as omni.kit / XCRReplayAPI
CI->>Script: python teleop_replay_agent.py --task X --replay_file Y
Script->>Env: gym.make(task, cfg=env_cfg).unwrapped
Script->>Device: create_teleop_device(handtracking, ...)
Script->>Env: env.reset()
Script->>Device: teleop_interface.reset()
Script->>Driver: asyncio.ensure_future(start_xcr_replay(cfg))
Note over Driver: Future exception silently lost if dropped
loop simulation_app.is_running()
Script->>Device: teleop_interface.advance()
Device-->>Script: action or None
alt action is not None
Script->>Env: env.step(actions)
else
Script->>Env: env.sim.render()
end
end
Driver->>Kit: set Carb XCR settings
Driver->>Kit: XCRReplayAPI() result not stored
Driver->>Kit: start_replay_if_enabled()
Kit-->>Driver: replay running
loop xcr_player._xcr_playback_subscription is not None
Driver->>Driver: asyncio.sleep(5)
end
Driver->>Kit: post_quit()
Kit-->>Script: simulation_app.is_running() = False
Script->>Env: env.close()
Reviews (1): Last reviewed commit: "Initial commit of CICD integration for t..." | Re-trigger Greptile |
| @@ -0,0 +1,118 @@ | |||
| # Copyright (c) 2022-2026, The Isaac Lab Project Developers (https://github.com/isaac-sim/IsaacLab/blob/main/CONTRIBUTORS.md). | |||
There was a problem hiding this comment.
Missing
__init__.py for the automation subpackage
Every other subpackage inside isaaclab_teleop (deprecated/, deprecated/openxr/, etc.) ships an explicit __init__.py. The automation/ directory introduced here has none. While Python 3.3+ namespace packages can make the directory importable in some layouts, the parent __init__.py calls lazy_export() which may alter __path__ resolution; the import from isaaclab_teleop.automation import XcrReplayConfig, start_xcr_replay in teleop_replay_agent.py is at risk of failing at runtime. An empty __init__.py is the safe, consistent fix.
| driver as a background task, and runs the standard teleop step loop | ||
| until the application is closed (driver-issued ``post_quit``, Kit |
There was a problem hiding this comment.
Unhandled future exception causes silent CI hang
asyncio.ensure_future schedules the coroutine but the returned Future is discarded. If start_xcr_replay raises before completing (e.g. FileNotFoundError, an import error, or any omni.kit failure), the exception is never retrieved — Python logs a Future exception was never retrieved warning but execution continues. Because post_quit() is never invoked, simulation_app.is_running() stays True and the CI job hangs indefinitely. The future should be stored and checked, or an exception callback should be attached.
|
|
||
|
|
||
| def main() -> None: | ||
| """Replay a captured teleop session against an Isaac Lab environment. | ||
|
|
There was a problem hiding this comment.
Environment leaked when device creation raises
env is created on line 120 (gym.make(...).unwrapped), but _create_replay_teleop_device on the next line raises ValueError or RuntimeError when the task config is missing the expected device. Those exceptions propagate out of main() with env still open and simulation_app never closed — mirroring the defensive pattern in teleop_se3_agent.py where each early-exit path calls env.close() and simulation_app.close() would prevent the resource leak.
| # before the AR profile is enabled. | ||
| await omni.kit.app.get_app().next_update_async() | ||
| await omni.kit.app.get_app().next_update_async() | ||
|
|
||
| logger.info("XCR replay: enabling XR profile %s", cfg.profile_name) | ||
| async with test_utils.EnabledXRProfile(cfg.profile_name, 0): |
There was a problem hiding this comment.
Completion detection via private attribute is fragile
xcr_player._xcr_playback_subscription is a private, non-public attribute. Any refactor or patch of omni.kit.xr.core.recorder.scripts.xcr_player that renames or removes this attribute will silently break completion detection: the while loop will exit immediately (if the attribute disappears, an AttributeError is raised) or spin forever (if the attribute is replaced by a different sentinel). A safer approach is to rely on a public API or a documented callback rather than polling an underscore-prefixed module-level variable.
| from omni.kit.xr.core.recorder.scripts.xcr_player import start_replay_if_enabled | ||
|
|
||
| settings = carb.settings.get_settings() | ||
|
|
There was a problem hiding this comment.
XCRReplayAPI() result is not stored
The constructed object is immediately eligible for garbage collection. If XCRReplayAPI registers itself in a global/singleton registry this is fine, but if it only registers via a weak reference or if any internal subscription is tied to the object's lifetime, the replay backend may silently unregister before start_replay_if_enabled() is called. Assigning the result to a local variable (e.g. _replay_api = XCRReplayAPI()) preserves the reference for the lifetime of the coroutine at essentially zero cost.
Description
Adds a permanent, decoupled CI entry-point for replaying captured teleop
sessions against an Isaac Lab environment. Replaces the runtime patch the
teleop-cicdpipeline currently applies toscripts/environments/teleoperation/teleop_se3_agent.pyso theuser-journey script is no longer mutated at CI time.
Fixes # (issue)
Type of change
Checklist
pre-commitchecks with./isaaclab.sh --formatconfig/extension.tomlfileCONTRIBUTORS.mdor my name already exists there