Add RL on Verifiable Reward (RLVR) reference implementations #5

jacobthebanana · 2025-09-17T14:44:38Z

This pull request includes:

GRPO example on openai/gsm8k using LLM judge to compare between proposed answer and ground truth.
Agent SDK integration- define the environment using the familiar OpenAI Agent SDK and run RL on the LLM powering the agent. Not yet tested on multi-agent setups (agent as tool or handoff)
Extensive typing for simplified function signatures and IDE support- static type checking, pyright lints, proper autocompletion even within the training loop.

…erified.)

Replaced base model with Qwen2.5-1.5B-Instruct in example trainer.

…t_vllm

Enables reuse of the backprop GPU for rollout.

kohankhaki · 2025-12-02T01:38:20Z

@jacobthebanana I’ll review this in multiple passes since it’s fairly large.
First I’ll focus on overall structure, then I’ll come back for details.

kohankhaki

I have left a few comments.
I might be wrong about a few of them, so please let me know if you think I am.

Below are the general comments:

Several classes and helper functions lack docstrings. Please add them. :)

kohankhaki · 2025-12-02T01:57:34Z

starters/llm_fine_tuning/README.md

The starters/ directory no longer exists. Everything is now under templates/.
RLVR is currently placed as a top-level directory (templates/src/rlvr/) alongside llm/, mlp/, vlm/. We need to have it under rl/ directory. Since RLVR (Reinforcement Learning with Verifiable Reward) is a type of RL, it should be organized under rl/ to match the documented structure.

Please remove starters/ directory.

kohankhaki · 2025-12-02T02:04:24Z

templates/src/rlvr/README.md

+- dataset (the example uses `openai/gsm8k`)
+- evaluation scheme and LLM judge setup
+
+## Optional- Observability Integration


Add reference to LangFuse.

How about "## Optional- Observability Integration via LangFuse"

kohankhaki · 2025-12-02T02:14:30Z

templates/src/rlvr/README.md

+## Setup
+
+Basic: you will need uv and a working PyTorch installation. Running from within a container is possible, but you need to make sure SLURM commands are available from within the container.
+
+### Option A- running vLLM in uv venv
+
+Make sure vLLM runs on your environment.
+
+- Create vLLM uv venv following [instructions from vllm](https://docs.vllm.ai/en/stable/getting_started/installation/gpu.html#set-up-using-python)
+- Make a copy of [run_in_venv.sh](/run_in_venv.sh) and point the uv venv to your newly-created vllm venv.
+- Remember to `chmod a+x <your_script.sh>`
+- Make sure that in a new GPU job, `<your_script.sh> uv run vllm serve <EXAMPLE_MODEL_NAME>` launches the vLLM server.
+
+Example: clusters running modern Linux distros for which pre-built vLLM wheels are available
+
+- Vector Institute Killarney
+- Mila TamIA
+
+### Option B- running vLLM via Singularity
+
+It might be difficult to install vLLM on some clusters- e.g., unusual Linux distribution. As long as vLLM runs through Singularity on these environments, these reference implementations would work there as well. Steps:
+
+- Make sure you can manually spin up singularity and run `vllm serve` from within the GPU container.
+- Make a copy of [run_in_container.sh](/run_in_container.sh) and point to your singularity image. Remember, all scripts will be added to `$@` and sent to singularity.
+- Remember to `chmod a+x <your_script.sh>`
+- Make sure that in a new GPU job, `<your_script.sh> uv run vllm serve <EXAMPLE_MODEL_NAME>` launches the vLLM server.
+
+Examples:
+
+- Vector Institute Bon Echo
+- Compute Canada Narval
+


The setup section includes references to clusters outside Vector Institute (Mila TamIA, Compute Canada Narval). Since this is a Vector-specific playbook, it should focus only on Vector clusters. Also, the setup section currently presents these as "Option A" and "Option B". I'd like to confirm:

Are these actual options (users can choose either method), or are they requirements determined by the cluster?

If they're requirements (not choices), then:

Killarney users must use the venv method (currently "Option A")

Bon Echo users must use the Singularity method (currently "Option B")

If this is the case, the README should be restructured to:

Remove "Option A/Option B" framing - present as cluster-specific requirements instead

Remove references to non-Vector clusters (Mila TamIA, Compute Canada Narval) since this is Vector-specific

Clarify cluster-to-method mapping at the top of the section

Additionally, please remove redundant PyTorch/uv mention since it's already in main templates README

Thanks for the suggestions- working on it

templates/src/rlvr/grpo/grpo_frontend.py

templates/src/rlvr/grpo/data_types.py

templates/src/rlvr/grpo/train.py

kohankhaki · 2025-12-02T04:08:10Z

templates/src/rl/rlvr/submitit_vllm.py

+        )
+        return submitit_job, watcher_task
+
+    def stop(self):


Can you please add a bit more documentation on this, on how it works?

templates/src/rlvr/grpo/train.py

templates/src/rl/rlvr/submitit_vllm.py

kohankhaki · 2025-12-02T04:17:00Z

pyproject.toml

 select = ["A","B","COM","C4","RET","SIM","ICN","Q","RSE","D","E","F","I","W","N","ERA","PL"]
 fixable = ["A","B","COM","C4","RET","SIM","ICN","Q","RSE","D","E","F","I","W","N","ERA","PL"]
-ignore = ["B905","E501","D203","D213","PLR2004","PLR0913","COM812"]
+ignore = ["B905","E501","D203","D213","PLC0415","PLR2004","PLR0913","COM812", "ERA001"]


Please don't change this.

These ignores can be justified:

PLC0415 allows us to import packages only when necessary.

ERA001 is commented-out code which triggers many false-positives.

Would it be better to do it inline instead?

- moved ignores inline - misc. whitespace fixes - cleaned up hardcoded values

kohankhaki

Thank you for making the changes. :)
Below are some changes from previous review that still needs to be made:

Directory structure: still at rlvr/ not rl/rlvr/
policy_agent and eval_agent are defined as global variables at module level (outside any class). I get your point that if people need to have a different agent, they need to change the config. But I think it should be fine. These templates are not supposed to be used as is. People can have their modifications on top. Moving constants to the config makes the code clean.
I agree with what you said about noqa. Let's keep your changes.
Also can you please fix the README as I mentioned in the prev review. It still has Option A/B, references to other clusters.

templates/src/rlvr/grpo/train.py

kohankhaki · 2025-12-04T23:27:49Z

templates/src/rlvr/grpo/launch.py

+@hydra.main(config_path=_CONFIG_PATH, config_name="_global", version_base=None)
+def main(cfg: DictConfig):
+    """Run entrypoint that merges local config and runs the Trainer."""
+    local_cfg = OmegaConf.load(os.path.join(os.path.dirname(__file__), "config.yaml"))
+    cfg = OmegaConf.merge(local_cfg, cfg)  # type: ignore
+    OmegaConf.set_struct(cfg, False)
+
+    if "trainer" in cfg:
+        trainer_cfg = cfg.trainer
+        cfg = OmegaConf.merge(cfg, trainer_cfg)  # type: ignore
+
+    grpo_config = GRPOConfig.model_validate(cfg.__dict__["_content"]["trainer"])
+    trainer = GRPOTrainer(grpo_config)
+    metrics = trainer(grpo_config)
+    for _epoch, (_item, _reward) in metrics:
+        print(f"epoch: {_epoch}, reward: {_reward:.03f}, metrics: {_item}")
+


doesn't match template pattern:
should follow templates/src/mlp/single/launch.py: fix config merge order (line 23: swap to merge(cfg, local_cfg)), move set_struct before merge (line 24), use cfg.trainer instead of cfg.__dict__["_content"]["trainer"] (line 30), make trainer parameterless __init__ and pass config in __call__ (lines 31-32), and use relative imports (lines 9-10).

Is there a particular reason to not include the config during __init__? Many of the setups can be done within __init__ without affecting pickling, and can make __call__ a lot more readable. Besides, declaring attributes within __init__ seems to make type-checking easier.

templates/src/rlvr/grpo/launch.py

jacobthebanana · 2025-12-14T01:12:04Z

@codex review

jacobthebanana added 4 commits September 8, 2025 16:17

Minimalistic GRPO optimizer step implementation. (to be numerically v…

7baf106

…erified.)

GRPO for Openai-Agents implementation. (Missing vec-inf component.)

acbba68

Added LangFuse integration to GRPO Agent RL.

658c4c2

Added AGENTS.md for programming agents.

c1a088f

jwilles requested a review from kohankhaki October 15, 2025 15:18

jacobthebanana added 6 commits October 27, 2025 15:50

Added submitit vllm

195bcdc

Added submitit vllm

c53f1bc

Updated trainer setup.

d2cb117

Made PerTokenProbs torch-only and cleaned up

2fdda56

Connected GRPO pipeline and eliminated log_prob host transfer

b7f14b4

Made served_model_name follow model name for observability.

cc64f35

Replaced base model with Qwen2.5-1.5B-Instruct in example trainer.

jacobthebanana marked this pull request as ready for review October 29, 2025 05:28

jacobthebanana added 17 commits October 29, 2025 01:35

Cleaned up test entrypoints and usernames.

2e2d777

Cleaned up and added documentations.

47ff5bb

Cleaned up documentations.

32a1430

Deleted .vscode and AGENTS.md

4500427

Cleaned up run_in_venv.sh

3500556

Cleaned up TODO and duplicate gather_with_progress

3230e11

Cleaned up references to vec-inf.

25d1bd6

Cleaned up GRPOMetrics.

1121b42

Deleted unused rate_limited in async_utils and indexed in submiti…

79bcede

…t_vllm

Merge branch 'main' into jjt-rlvr-grpo

d3b38bb

Moved starter -> templates

f9f0d1f

Refactored GRPO config- replaced argparse with hydra/omegaconf

2a644bd

Implemented support for local executors.

24bcdc6

Enables reuse of the backprop GPU for rollout.

Moved backprop step into local executor to free up GPU memory.

034ba04

Applied linting fixes.

66139fa

Added instructions for RLVR

d02bd4a

Implemented support for RLVR GRPO tracking via LangFuse dataset.

ba0ea9b

kohankhaki requested changes Dec 2, 2025

View reviewed changes

- Fixed race conditions in vLLM worker and submitit batch

9a862b6

- moved ignores inline - misc. whitespace fixes - cleaned up hardcoded values

kohankhaki approved these changes Dec 5, 2025

View reviewed changes

jacobthebanana added 7 commits December 9, 2025 18:49

Refactored to use relative imports and move into rl/rlvr

a0d2141

Config and checkpoint clean-up fixes

6ce0130

Merge remote-tracking branch 'origin/main' into jjt-rlvr-grpo

50fc334

Renamed vaughan -> bon_echo

2066506

Fixed OmegaConf integration for GRPO

5d053df

Added comments regarding moving agents into GRPOTrainer class.

fdf2d9a

Fixed tokenizer handling in rollout

fc71b1e

Add RL on Verifiable Reward (RLVR) reference implementations #5

Are you sure you want to change the base?

Add RL on Verifiable Reward (RLVR) reference implementations #5

Uh oh!

Conversation

jacobthebanana commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kohankhaki commented Dec 2, 2025

Uh oh!

kohankhaki left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kohankhaki left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jacobthebanana commented Dec 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jacobthebanana commented Sep 17, 2025 •

edited

Loading