-
Notifications
You must be signed in to change notification settings - Fork 2.3k
feature: Add RTPO Trainer #4652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| schedule_type (`str`,defaults to linear): | ||
| Choose a schedule type for AnnealingScheduler to control thinking guidance length. | ||
| Supports: linear, cosine, exponential, piecewise, constant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| schedule_type (`str`,defaults to linear): | |
| Choose a schedule type for AnnealingScheduler to control thinking guidance length. | |
| Supports: linear, cosine, exponential, piecewise, constant | |
| schedule_type (`str`, *optional*, defaults to `"linear"`): | |
| Choose a schedule type for AnnealingScheduler to control thinking guidance length. Supports: `"linear"`, | |
| `"cosine, `"exponential"`, `"piecewise"`, `"constant"`. |
can you try to try to align the docstring with the rest of the codebase? Above is an example
| Constant value for constant schedule. | ||
| """ | ||
|
|
||
| schedule_type: str = "linear" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, we usually use dataclasses.field. You can the GRPOConfig as an example.
|
Thanks for the PR! Can you:
You can take your other PR #4334 as an example |
π Reverse Thinking Policy Optimization (RTPO)
This PR introduces Reverse Thinking Policy Optimization (RTPO) β a new RL training method for LLMs built on top of
GRPOTrainer.π Motivation
Current GRPO-based RL methods require the model to autonomously generate a full chain-of-thought before producing the final answer.
However, many training datasets already contain complete, high-quality reasoning traces that the model could benefit from.
RTPO is designed to:
π§ Method Overview
RTPO modifies the standard GRPO rollout process:
Full Auxiliary CoT Injection
At rollout step 0, the full reasoning chain from the dataset is concatenated into the input prompt.
Model behavior:
Reverse Annealing of Auxiliary CoT
As training steps increase, RTPO gradually removes tokens from the end of the auxiliary CoT based on a configurable schedule:
Expected Model behavior:
Interesting Finding: Emergent Shorter Reasoning
Unexpectedly, RTPO also teaches the model to shorten its reasoning:
More experiments are ongoing and will be included later.
π¦ Files Added / Modified
trl/experimental/rtpo/__init__.pytrl/experimental/rtpo/rtpo_config.pytrl/experimental/rtpo/rtpo_trainer.pyπ§ͺ Example Usage
Grabed from my repo named AVR
β Status
π Request for Review