-
Notifications
You must be signed in to change notification settings - Fork 1
Add RL on Verifiable Reward (RLVR) reference implementations #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Replaced base model with Qwen2.5-1.5B-Instruct in example trainer.
Enables reuse of the backprop GPU for rollout.
|
@jacobthebanana I’ll review this in multiple passes since it’s fairly large. |
kohankhaki
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have left a few comments.
I might be wrong about a few of them, so please let me know if you think I am.
Below are the general comments:
- Several classes and helper functions lack docstrings. Please add them. :)
starters/llm_fine_tuning/README.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The starters/ directory no longer exists. Everything is now under templates/.
RLVR is currently placed as a top-level directory (templates/src/rlvr/) alongside llm/, mlp/, vlm/. We need to have it under rl/ directory. Since RLVR (Reinforcement Learning with Verifiable Reward) is a type of RL, it should be organized under rl/ to match the documented structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove starters/ directory.
templates/src/rlvr/README.md
Outdated
| - dataset (the example uses `openai/gsm8k`) | ||
| - evaluation scheme and LLM judge setup | ||
|
|
||
| ## Optional- Observability Integration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add reference to LangFuse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about "## Optional- Observability Integration via LangFuse"
templates/src/rlvr/README.md
Outdated
| ## Setup | ||
|
|
||
| Basic: you will need uv and a working PyTorch installation. Running from within a container is possible, but you need to make sure SLURM commands are available from within the container. | ||
|
|
||
| ### Option A- running vLLM in uv venv | ||
|
|
||
| Make sure vLLM runs on your environment. | ||
|
|
||
| - Create vLLM uv venv following [instructions from vllm](https://docs.vllm.ai/en/stable/getting_started/installation/gpu.html#set-up-using-python) | ||
| - Make a copy of [run_in_venv.sh](/run_in_venv.sh) and point the uv venv to your newly-created vllm venv. | ||
| - Remember to `chmod a+x <your_script.sh>` | ||
| - Make sure that in a new GPU job, `<your_script.sh> uv run vllm serve <EXAMPLE_MODEL_NAME>` launches the vLLM server. | ||
|
|
||
| Example: clusters running modern Linux distros for which pre-built vLLM wheels are available | ||
|
|
||
| - Vector Institute Killarney | ||
| - Mila TamIA | ||
|
|
||
| ### Option B- running vLLM via Singularity | ||
|
|
||
| It might be difficult to install vLLM on some clusters- e.g., unusual Linux distribution. As long as vLLM runs through Singularity on these environments, these reference implementations would work there as well. Steps: | ||
|
|
||
| - Make sure you can manually spin up singularity and run `vllm serve` from within the GPU container. | ||
| - Make a copy of [run_in_container.sh](/run_in_container.sh) and point to your singularity image. Remember, all scripts will be added to `$@` and sent to singularity. | ||
| - Remember to `chmod a+x <your_script.sh>` | ||
| - Make sure that in a new GPU job, `<your_script.sh> uv run vllm serve <EXAMPLE_MODEL_NAME>` launches the vLLM server. | ||
|
|
||
| Examples: | ||
|
|
||
| - Vector Institute Bon Echo | ||
| - Compute Canada Narval | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The setup section includes references to clusters outside Vector Institute (Mila TamIA, Compute Canada Narval). Since this is a Vector-specific playbook, it should focus only on Vector clusters. Also, the setup section currently presents these as "Option A" and "Option B". I'd like to confirm:
Are these actual options (users can choose either method), or are they requirements determined by the cluster?
If they're requirements (not choices), then:
- Killarney users must use the venv method (currently "Option A")
- Bon Echo users must use the Singularity method (currently "Option B")
If this is the case, the README should be restructured to:
- Remove "Option A/Option B" framing - present as cluster-specific requirements instead
- Remove references to non-Vector clusters (Mila TamIA, Compute Canada Narval) since this is Vector-specific
- Clarify cluster-to-method mapping at the top of the section
Additionally, please remove redundant PyTorch/uv mention since it's already in main templates README
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestions- working on it
| ) | ||
| return submitit_job, watcher_task | ||
|
|
||
| def stop(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a bit more documentation on this, on how it works?
pyproject.toml
Outdated
| select = ["A","B","COM","C4","RET","SIM","ICN","Q","RSE","D","E","F","I","W","N","ERA","PL"] | ||
| fixable = ["A","B","COM","C4","RET","SIM","ICN","Q","RSE","D","E","F","I","W","N","ERA","PL"] | ||
| ignore = ["B905","E501","D203","D213","PLR2004","PLR0913","COM812"] | ||
| ignore = ["B905","E501","D203","D213","PLC0415","PLR2004","PLR0913","COM812", "ERA001"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't change this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- moved ignores inline - misc. whitespace fixes - cleaned up hardcoded values
kohankhaki
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making the changes. :)
Below are some changes from previous review that still needs to be made:
- Directory structure: still at
rlvr/notrl/rlvr/ policy_agentandeval_agentare defined as global variables at module level (outside any class). I get your point that if people need to have a different agent, they need to change the config. But I think it should be fine. These templates are not supposed to be used as is. People can have their modifications on top. Moving constants to the config makes the code clean.- I agree with what you said about
noqa. Let's keep your changes. - Also can you please fix the README as I mentioned in the prev review. It still has Option A/B, references to other clusters.
templates/src/rlvr/grpo/launch.py
Outdated
| @hydra.main(config_path=_CONFIG_PATH, config_name="_global", version_base=None) | ||
| def main(cfg: DictConfig): | ||
| """Run entrypoint that merges local config and runs the Trainer.""" | ||
| local_cfg = OmegaConf.load(os.path.join(os.path.dirname(__file__), "config.yaml")) | ||
| cfg = OmegaConf.merge(local_cfg, cfg) # type: ignore | ||
| OmegaConf.set_struct(cfg, False) | ||
|
|
||
| if "trainer" in cfg: | ||
| trainer_cfg = cfg.trainer | ||
| cfg = OmegaConf.merge(cfg, trainer_cfg) # type: ignore | ||
|
|
||
| grpo_config = GRPOConfig.model_validate(cfg.__dict__["_content"]["trainer"]) | ||
| trainer = GRPOTrainer(grpo_config) | ||
| metrics = trainer(grpo_config) | ||
| for _epoch, (_item, _reward) in metrics: | ||
| print(f"epoch: {_epoch}, reward: {_reward:.03f}, metrics: {_item}") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't match template pattern:
should follow templates/src/mlp/single/launch.py: fix config merge order (line 23: swap to merge(cfg, local_cfg)), move set_struct before merge (line 24), use cfg.trainer instead of cfg.__dict__["_content"]["trainer"] (line 30), make trainer parameterless __init__ and pass config in __call__ (lines 31-32), and use relative imports (lines 9-10).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a particular reason to not include the config during __init__? Many of the setups can be done within __init__ without affecting pickling, and can make __call__ a lot more readable. Besides, declaring attributes within __init__ seems to make type-checking easier.
|
@codex review |
This pull request includes:
openai/gsm8kusing LLM judge to compare between proposed answer and ground truth.