Skip to content

Conversation

@jacobthebanana
Copy link
Collaborator

@jacobthebanana jacobthebanana commented Sep 17, 2025

This pull request includes:

  • GRPO example on openai/gsm8k using LLM judge to compare between proposed answer and ground truth.
  • Agent SDK integration- define the environment using the familiar OpenAI Agent SDK and run RL on the LLM powering the agent. Not yet tested on multi-agent setups (agent as tool or handoff)
  • Extensive typing for simplified function signatures and IDE support- static type checking, pyright lints, proper autocompletion even within the training loop.

@jwilles jwilles requested a review from kohankhaki October 15, 2025 15:18
@jacobthebanana jacobthebanana marked this pull request as ready for review October 29, 2025 05:28
@kohankhaki
Copy link
Collaborator

@jacobthebanana I’ll review this in multiple passes since it’s fairly large.
First I’ll focus on overall structure, then I’ll come back for details.

Copy link
Collaborator

@kohankhaki kohankhaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have left a few comments.
I might be wrong about a few of them, so please let me know if you think I am.

Below are the general comments:

  • Several classes and helper functions lack docstrings. Please add them. :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The starters/ directory no longer exists. Everything is now under templates/.
RLVR is currently placed as a top-level directory (templates/src/rlvr/) alongside llm/, mlp/, vlm/. We need to have it under rl/ directory. Since RLVR (Reinforcement Learning with Verifiable Reward) is a type of RL, it should be organized under rl/ to match the documented structure.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove starters/ directory.

- dataset (the example uses `openai/gsm8k`)
- evaluation scheme and LLM judge setup

## Optional- Observability Integration
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add reference to LangFuse.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "## Optional- Observability Integration via LangFuse"

Comment on lines 23 to 54
## Setup

Basic: you will need uv and a working PyTorch installation. Running from within a container is possible, but you need to make sure SLURM commands are available from within the container.

### Option A- running vLLM in uv venv

Make sure vLLM runs on your environment.

- Create vLLM uv venv following [instructions from vllm](https://docs.vllm.ai/en/stable/getting_started/installation/gpu.html#set-up-using-python)
- Make a copy of [run_in_venv.sh](/run_in_venv.sh) and point the uv venv to your newly-created vllm venv.
- Remember to `chmod a+x <your_script.sh>`
- Make sure that in a new GPU job, `<your_script.sh> uv run vllm serve <EXAMPLE_MODEL_NAME>` launches the vLLM server.

Example: clusters running modern Linux distros for which pre-built vLLM wheels are available

- Vector Institute Killarney
- Mila TamIA

### Option B- running vLLM via Singularity

It might be difficult to install vLLM on some clusters- e.g., unusual Linux distribution. As long as vLLM runs through Singularity on these environments, these reference implementations would work there as well. Steps:

- Make sure you can manually spin up singularity and run `vllm serve` from within the GPU container.
- Make a copy of [run_in_container.sh](/run_in_container.sh) and point to your singularity image. Remember, all scripts will be added to `$@` and sent to singularity.
- Remember to `chmod a+x <your_script.sh>`
- Make sure that in a new GPU job, `<your_script.sh> uv run vllm serve <EXAMPLE_MODEL_NAME>` launches the vLLM server.

Examples:

- Vector Institute Bon Echo
- Compute Canada Narval

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The setup section includes references to clusters outside Vector Institute (Mila TamIA, Compute Canada Narval). Since this is a Vector-specific playbook, it should focus only on Vector clusters. Also, the setup section currently presents these as "Option A" and "Option B". I'd like to confirm:

Are these actual options (users can choose either method), or are they requirements determined by the cluster?

If they're requirements (not choices), then:

  • Killarney users must use the venv method (currently "Option A")
  • Bon Echo users must use the Singularity method (currently "Option B")

If this is the case, the README should be restructured to:

  1. Remove "Option A/Option B" framing - present as cluster-specific requirements instead
  2. Remove references to non-Vector clusters (Mila TamIA, Compute Canada Narval) since this is Vector-specific
  3. Clarify cluster-to-method mapping at the top of the section

Additionally, please remove redundant PyTorch/uv mention since it's already in main templates README

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions- working on it

)
return submitit_job, watcher_task

def stop(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a bit more documentation on this, on how it works?

pyproject.toml Outdated
select = ["A","B","COM","C4","RET","SIM","ICN","Q","RSE","D","E","F","I","W","N","ERA","PL"]
fixable = ["A","B","COM","C4","RET","SIM","ICN","Q","RSE","D","E","F","I","W","N","ERA","PL"]
ignore = ["B905","E501","D203","D213","PLR2004","PLR0913","COM812"]
ignore = ["B905","E501","D203","D213","PLC0415","PLR2004","PLR0913","COM812", "ERA001"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't change this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These ignores can be justified:

PLC0415 allows us to import packages only when necessary.

ERA001 is commented-out code which triggers many false-positives.

Would it be better to do it inline instead?

- moved ignores inline
- misc. whitespace fixes
- cleaned up hardcoded values
Copy link
Collaborator

@kohankhaki kohankhaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making the changes. :)
Below are some changes from previous review that still needs to be made:

  1. Directory structure: still at rlvr/ not rl/rlvr/
  2. policy_agent and eval_agent are defined as global variables at module level (outside any class). I get your point that if people need to have a different agent, they need to change the config. But I think it should be fine. These templates are not supposed to be used as is. People can have their modifications on top. Moving constants to the config makes the code clean.
  3. I agree with what you said about noqa. Let's keep your changes.
  4. Also can you please fix the README as I mentioned in the prev review. It still has Option A/B, references to other clusters.

Comment on lines 19 to 35
@hydra.main(config_path=_CONFIG_PATH, config_name="_global", version_base=None)
def main(cfg: DictConfig):
"""Run entrypoint that merges local config and runs the Trainer."""
local_cfg = OmegaConf.load(os.path.join(os.path.dirname(__file__), "config.yaml"))
cfg = OmegaConf.merge(local_cfg, cfg) # type: ignore
OmegaConf.set_struct(cfg, False)

if "trainer" in cfg:
trainer_cfg = cfg.trainer
cfg = OmegaConf.merge(cfg, trainer_cfg) # type: ignore

grpo_config = GRPOConfig.model_validate(cfg.__dict__["_content"]["trainer"])
trainer = GRPOTrainer(grpo_config)
metrics = trainer(grpo_config)
for _epoch, (_item, _reward) in metrics:
print(f"epoch: {_epoch}, reward: {_reward:.03f}, metrics: {_item}")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't match template pattern:
should follow templates/src/mlp/single/launch.py: fix config merge order (line 23: swap to merge(cfg, local_cfg)), move set_struct before merge (line 24), use cfg.trainer instead of cfg.__dict__["_content"]["trainer"] (line 30), make trainer parameterless __init__ and pass config in __call__ (lines 31-32), and use relative imports (lines 9-10).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason to not include the config during __init__? Many of the setups can be done within __init__ without affecting pickling, and can make __call__ a lot more readable. Besides, declaring attributes within __init__ seems to make type-checking easier.

@jacobthebanana
Copy link
Collaborator Author

@codex review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants