Skip to content

feat(rl): add rollout collector and JSONL export CLI#335

Open
MagellaX wants to merge 7 commits intohud-evals:mainfrom
MagellaX:feature/rollout-collector-mvp
Open

feat(rl): add rollout collector and JSONL export CLI#335
MagellaX wants to merge 7 commits intohud-evals:mainfrom
MagellaX:feature/rollout-collector-mvp

Conversation

@MagellaX
Copy link
Contributor

@MagellaX MagellaX commented Feb 17, 2026

Summary

  • add hud.rl rollout schema and collector utilities to build offline rollout records from run_dataset
  • add hud rollout collect CLI command to execute tasks and export trajectories to JSONL
  • support group_size repetition per task and deterministic rollout IDs
  • add focused tests for collector grouping/coercion/export and CLI flow

Impact

This is a low-risk first step toward RL data workflows in the SDK.
Users can now collect structured trajectories from existing eval tasks without changing the training stack.

Continuation

This PR intentionally keeps scope tight. It sets up the next step to add richer trajectory metadata (e.g. action-level reward shaping fields) without reworking evaluation execution.


Note

Medium Risk
Introduces new execution/export pathways on top of run_dataset and tightens JSON/JSONL task parsing in strict mode; incorrect coercion or stricter validation could affect rollout collection and local task ingestion behavior.

Overview
Adds a new hud rollout collect CLI workflow to execute eval tasks and export structured rollout trajectories to JSONL, including support for per-task repetition via --group-size, concurrency/step limits, tool allow/deny lists, and deterministic rollout_ids.

Introduces a new hud.rl module with RolloutRecord schema plus collector utilities that load tasks (local JSON/JSONL or dataset source + split), run them through run_dataset, coerce results into a normalized Trace, and write records to disk. Task loading is tightened for rollout paths by adding a strict mode to _load_raw_from_file to error on non-object entries with clearer location context, and new unit tests cover collector behavior and the CLI flow.

Written by Cursor Bugbot for commit 92658ed. This will update automatically on new commits. Configure here.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e97fb4b914

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

return []

expanded_tasks = _expand_tasks(raw_tasks, group_size=group_size)
source_name = source if isinstance(source, str) else name

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve split in rollout source IDs

When source is a HuggingFace dataset and split is non-default, tasks are loaded from source:split but source_name is still set to the unsuffixed source. Because build_rollout_records() uses source in make_rollout_id(), collecting different splits can produce identical rollout IDs for matching prompts/task indexes, and the exported source field also loses split provenance. This can cause collisions or incorrect joins when combining train/test exports.

Useful? React with 👍 / 👎.

Comment on lines 53 to 56
config = {"verbose": verbose, "validate_api_key": False}
if allowed_tools:
config["allowed_tools"] = allowed_tools
return OperatorAgent, config

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Pass --model through to OpenAI rollouts

The OpenAI branch ignores the CLI --model value and always instantiates OperatorAgent with its default model, even though collect advertises --model as backend-specific. This means users cannot pin or vary OpenAI models for rollout collection, which can silently skew experiment reproducibility and comparisons.

Useful? React with 👍 / 👎.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant