Skip to content

Add teleop replay agent for non-interactive CI runs#5507

Open
rwiltz wants to merge 1 commit intoisaac-sim:developfrom
rwiltz:rwiltz/integrate-teleop-cicd
Open

Add teleop replay agent for non-interactive CI runs#5507
rwiltz wants to merge 1 commit intoisaac-sim:developfrom
rwiltz:rwiltz/integrate-teleop-cicd

Conversation

@rwiltz
Copy link
Copy Markdown
Contributor

@rwiltz rwiltz commented May 5, 2026

Description

Adds a permanent, decoupled CI entry-point for replaying captured teleop
sessions against an Isaac Lab environment. Replaces the runtime patch the
teleop-cicd pipeline currently applies to
scripts/environments/teleoperation/teleop_se3_agent.py so the
user-journey script is no longer mutated at CI time.

Fixes # (issue)

Type of change

  • New feature (non-breaking change which adds functionality)

Checklist

  • I have read and understood the contribution guidelines
  • I have run the pre-commit checks with ./isaaclab.sh --format
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the changelog and the corresponding version in the extension's config/extension.toml file
  • I have added my name to the CONTRIBUTORS.md or my name already exists there

@rwiltz rwiltz requested a review from ooctipus as a code owner May 5, 2026 22:22
@github-actions github-actions Bot added the isaac-lab Related to Isaac Lab team label May 5, 2026
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot

Summary

This PR adds a dedicated CI script (teleop_replay_agent.py) for replaying captured teleoperation sessions, along with an internal XCR replay driver module. The goal is to replace runtime patches that CI previously applied to the interactive teleop_se3_agent.py. The implementation is structurally sound but has several issues around error handling, missing module initialization, and potential race conditions.

Architecture Impact

  • New public script: scripts/environments/teleoperation/teleop_replay_agent.py becomes a CI entry point
  • New internal module: isaaclab_teleop.automation.xcr_replay - but the automation subpackage appears to lack an __init__.py, making imports fail
  • Dependencies: Relies on isaaclab.devices.openxr.remove_camera_configs and isaaclab.devices.teleop_device_factory.create_teleop_device - need to verify these exist in the codebase
  • Cross-module: Uses isaaclab_tasks.utils.parse_env_cfg which is a stable public API

Implementation Verdict

Significant concerns — Missing __init__.py will cause import failures, and several edge cases in the simulation loop need attention.

Test Coverage

The PR description claims tests were added ("I have added tests that prove my fix is effective or that my feature works") but no test files are included in the changeset. For a CI automation script, integration tests or at minimum a smoke test would be valuable. The XCR replay driver has complex async behavior that's difficult to test in isolation.

CI Status

No CI checks available yet — cannot verify the code actually runs in the target environment.

Findings

🔴 Critical: source/isaaclab_teleop/isaaclab_teleop/automation/xcr_replay.py — Missing __init__.py for automation subpackage
The new automation subpackage has no __init__.py. The import at teleop_replay_agent.py:108:

from isaaclab_teleop.automation import XcrReplayConfig, start_xcr_replay

will raise ModuleNotFoundError: No module named 'isaaclab_teleop.automation'. You need to create source/isaaclab_teleop/isaaclab_teleop/automation/__init__.py that exports these symbols.

🔴 Critical: scripts/environments/teleoperation/teleop_replay_agent.py:141-145 — Unhandled exception silently breaks loop without cleanup

except Exception as e:
    logger.error(f"Error during simulation step: {e}")
    break

This catches ALL exceptions including KeyboardInterrupt (which inherits from BaseException, so actually won't be caught) and programming errors. The break exits the loop but:

  1. The async replay driver keeps running and will eventually call post_quit() on an already-closed env
  2. The exception traceback is lost (only message logged, not the full trace)
  3. No re-raise means CI will exit 0 on simulation errors

Consider: logger.exception("...") to preserve traceback, and potentially raise after cleanup or set a non-zero exit code.

🟡 Warning: scripts/environments/teleoperation/teleop_replay_agent.py:137-140 — action.repeat() may fail on 1D tensors

action = teleop_interface.advance()
...
actions = action.repeat(env.num_envs, 1)

If teleop_interface.advance() returns a 1D tensor of shape (action_dim,), then .repeat(env.num_envs, 1) produces shape (action_dim,) not (num_envs, action_dim). The correct pattern for replicating across envs is:

actions = action.unsqueeze(0).repeat(env.num_envs, 1)
# or
actions = action.expand(env.num_envs, -1)

Verify the expected shape from advance() — if it's already 2D with batch dim 1, the current code is fine.

🟡 Warning: source/isaaclab_teleop/isaaclab_teleop/automation/xcr_replay.py:105-107 — Polling internal module state is fragile

while xcr_player._xcr_playback_subscription is not None:
    logger.debug("XCR replay: waiting for playback subscription to clear")
    await asyncio.sleep(5)

Relying on the underscore-prefixed _xcr_playback_subscription internal variable is brittle — it could be renamed or removed in future Kit versions. The comment acknowledges this ("public-ish signal") but there's no timeout. If the subscription never clears (bug in xcr_player, crash, etc.), this loops forever. Add a timeout:

timeout = 3600  # 1 hour max
elapsed = 0
while xcr_player._xcr_playback_subscription is not None and elapsed < timeout:
    await asyncio.sleep(5)
    elapsed += 5
if elapsed >= timeout:
    logger.warning("XCR replay: timed out waiting for playback to complete")

🟡 Warning: scripts/environments/teleoperation/teleop_replay_agent.py:108-109 — asyncio.ensure_future without a running loop

asyncio.ensure_future(start_xcr_replay(XcrReplayConfig(...)))

This requires an asyncio event loop to already be running. Kit's main loop typically provides this, but if called before Kit's async infrastructure is fully initialized, this will raise RuntimeError: no running event loop. The script structure suggests this is called after AppLauncher setup, so it's likely fine, but adding defensive handling or a comment explaining the assumption would help maintainability.

🔵 Improvement: scripts/environments/teleoperation/teleop_replay_agent.py:134-135 — Missing render call when action is None may cause frame drops

if action is None:
    env.sim.render()
    continue

When action is None, you render but skip env.step(). This is correct for not advancing physics, but the simulation clock doesn't advance either. Over a 120-second warmup period, this could lead to thousands of render-only iterations. Consider adding a small sleep to avoid busy-spinning:

if action is None:
    env.sim.render()
    simulation_app.update()  # May already be implicit
    continue

🔵 Improvement: source/isaaclab_teleop/isaaclab_teleop/automation/xcr_replay.py:91 — Instantiating XCRReplayAPI() without storing reference

XCRReplayAPI()

This instantiates an object solely for its side effects (registering the replay backend). While this works, it's unclear to future readers. Add a comment or store in a variable with _ prefix to signal intentional discard:

_ = XCRReplayAPI()  # Side effect: registers replay backend

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 5, 2026

Greptile Summary

This PR adds a permanent, non-interactive CI entry-point (teleop_replay_agent.py) for replaying captured XCR teleop sessions against an Isaac Lab environment, backed by a new isaaclab_teleop.automation subpackage containing the XCR replay driver. It replaces a runtime patch that the teleop-cicd pipeline previously applied to the user-facing teleop_se3_agent.py.

  • teleop_replay_agent.py: Builds the RL env, attaches the legacy handtracking teleop device, schedules the replay driver via asyncio.ensure_future, and pumps the simulation loop until the application exits or an error occurs.
  • xcr_replay.py: Drives Kit's OpenXR XCR backend — configures Carb settings, constructs XCRReplayAPI, calls start_replay_if_enabled(), enables the XR profile, and polls xcr_player._xcr_playback_subscription to detect completion before calling post_quit().

Confidence Score: 3/5

Not safe to merge as-is — the discarded asyncio Future means any failure in the replay coroutine causes the CI job to hang indefinitely, and the missing init.py may prevent the automation subpackage from being imported at all.

Three distinct defects affect the core purpose of this change: the asyncio Future is discarded causing silent hangs on failure, the automation subpackage is missing its init.py unlike every sibling package, and the environment is leaked when device creation raises.

Both new files need attention: teleop_replay_agent.py for the Future exception handling and env cleanup, and xcr_replay.py for the missing init.py, the unretained XCRReplayAPI reference, and the private-attribute completion poll.

Important Files Changed

Filename Overview
scripts/environments/teleoperation/teleop_replay_agent.py New CI entry-point for non-interactive teleop replay; has two issues: (1) environment is leaked if device creation raises, (2) the asyncio Future returned by ensure_future is discarded, so any exception in the replay coroutine causes a silent hang rather than a clean failure.
source/isaaclab_teleop/isaaclab_teleop/automation/xcr_replay.py New XCR replay driver; missing init.py for the automation subpackage (inconsistent with every other sibling package), polls a private attribute for completion detection, and discards the XCRReplayAPI reference that may need to stay alive.
source/isaaclab_teleop/changelog.d/rwiltz-xcr-replay-agent.minor.rst Changelog fragment describing the new automation subpackage and replay entry-point; no issues.

Sequence Diagram

sequenceDiagram
    participant CI as CI Pipeline
    participant Script as teleop_replay_agent.py
    participant Env as Isaac Lab Env
    participant Device as handtracking device
    participant Driver as start_xcr_replay (async)
    participant Kit as omni.kit / XCRReplayAPI

    CI->>Script: python teleop_replay_agent.py --task X --replay_file Y
    Script->>Env: gym.make(task, cfg=env_cfg).unwrapped
    Script->>Device: create_teleop_device(handtracking, ...)
    Script->>Env: env.reset()
    Script->>Device: teleop_interface.reset()
    Script->>Driver: asyncio.ensure_future(start_xcr_replay(cfg))
    Note over Driver: Future exception silently lost if dropped

    loop simulation_app.is_running()
        Script->>Device: teleop_interface.advance()
        Device-->>Script: action or None
        alt action is not None
            Script->>Env: env.step(actions)
        else
            Script->>Env: env.sim.render()
        end
    end

    Driver->>Kit: set Carb XCR settings
    Driver->>Kit: XCRReplayAPI() result not stored
    Driver->>Kit: start_replay_if_enabled()
    Kit-->>Driver: replay running

    loop xcr_player._xcr_playback_subscription is not None
        Driver->>Driver: asyncio.sleep(5)
    end

    Driver->>Kit: post_quit()
    Kit-->>Script: simulation_app.is_running() = False
    Script->>Env: env.close()
Loading

Reviews (1): Last reviewed commit: "Initial commit of CICD integration for t..." | Re-trigger Greptile

@@ -0,0 +1,118 @@
# Copyright (c) 2022-2026, The Isaac Lab Project Developers (https://github.com/isaac-sim/IsaacLab/blob/main/CONTRIBUTORS.md).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Missing __init__.py for the automation subpackage

Every other subpackage inside isaaclab_teleop (deprecated/, deprecated/openxr/, etc.) ships an explicit __init__.py. The automation/ directory introduced here has none. While Python 3.3+ namespace packages can make the directory importable in some layouts, the parent __init__.py calls lazy_export() which may alter __path__ resolution; the import from isaaclab_teleop.automation import XcrReplayConfig, start_xcr_replay in teleop_replay_agent.py is at risk of failing at runtime. An empty __init__.py is the safe, consistent fix.

Comment on lines +126 to +127
driver as a background task, and runs the standard teleop step loop
until the application is closed (driver-issued ``post_quit``, Kit
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Unhandled future exception causes silent CI hang

asyncio.ensure_future schedules the coroutine but the returned Future is discarded. If start_xcr_replay raises before completing (e.g. FileNotFoundError, an import error, or any omni.kit failure), the exception is never retrieved — Python logs a Future exception was never retrieved warning but execution continues. Because post_quit() is never invoked, simulation_app.is_running() stays True and the CI job hangs indefinitely. The future should be stored and checked, or an exception callback should be attached.

Comment on lines +120 to +124


def main() -> None:
"""Replay a captured teleop session against an Isaac Lab environment.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Environment leaked when device creation raises

env is created on line 120 (gym.make(...).unwrapped), but _create_replay_teleop_device on the next line raises ValueError or RuntimeError when the task config is missing the expected device. Those exceptions propagate out of main() with env still open and simulation_app never closed — mirroring the defensive pattern in teleop_se3_agent.py where each early-exit path calls env.close() and simulation_app.close() would prevent the resource leak.

Comment on lines +99 to +104
# before the AR profile is enabled.
await omni.kit.app.get_app().next_update_async()
await omni.kit.app.get_app().next_update_async()

logger.info("XCR replay: enabling XR profile %s", cfg.profile_name)
async with test_utils.EnabledXRProfile(cfg.profile_name, 0):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Completion detection via private attribute is fragile

xcr_player._xcr_playback_subscription is a private, non-public attribute. Any refactor or patch of omni.kit.xr.core.recorder.scripts.xcr_player that renames or removes this attribute will silently break completion detection: the while loop will exit immediately (if the attribute disappears, an AttributeError is raised) or spin forever (if the attribute is replaced by a different sentinel). A safer approach is to rely on a public API or a documented callback rather than polling an underscore-prefixed module-level variable.

from omni.kit.xr.core.recorder.scripts.xcr_player import start_replay_if_enabled

settings = carb.settings.get_settings()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 XCRReplayAPI() result is not stored

The constructed object is immediately eligible for garbage collection. If XCRReplayAPI registers itself in a global/singleton registry this is fine, but if it only registers via a weak reference or if any internal subscription is tied to the object's lifetime, the replay backend may silently unregister before start_replay_if_enabled() is called. Assigning the result to a local variable (e.g. _replay_api = XCRReplayAPI()) preserves the reference for the lifetime of the coroutine at essentially zero cost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant