-
Notifications
You must be signed in to change notification settings - Fork 234
[Cookbook Recipe of Using Rubric] #166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Claude finished @ruiqi-zhong's task in 1m 9s —— View job Reviewing PR #166: Very simple example of using rubric
SummaryThis PR adds a well-structured example demonstrating rubric-based LLM grading for RL training on an addition task. The code is pedagogically clear and follows the cookbook's established patterns. Below are my observations: ✅ Strengths
🔧 Suggestions for Improvement1. Duplicate import in
|
|
Claude finished @ruiqi-zhong's task in 1m 21s —— View job Reviewing PR #166: Very simple example of using rubric
SummaryThis PR adds a well-structured pedagogical example demonstrating rubric-based LLM grading for RL training. The code follows the cookbook's established patterns (builder pattern, 🐛 Bug Found
def __call__(self) -> Sequence[RubricBasedDatapoint]:
datapoints = []
with open(self.jsonl_path, "r") as f:
for line in f:
data = json.loads(line) # data is a dict
datapoints.append(RubricBasedDatapoint.from_json(data)) # but from_json expects str
return datapoints
✅ Strengths
🔧 Minor Suggestions
📝 Notes
VerdictThe PR is a solid pedagogical addition. The type mismatch bug in |
Rubric-based Grading for LLMs
data.pycontains the definition for the datapoint class. Each datapoint consists of a conversation prefix and a list of rubric items.generate_data.pygenerates some example datapoint if you want to run our demo on addition.env.pydetermines what each rollout will do. It will let the policy read the prefix, generate a response, ask a grader LLM to grade based on a list of rubric items, and finally provide a reward by summing the response of each grader.train.pyallows you train LLMs on any dataset saved in our format (specified indata.py). The default script will train on the addition task, whose data is generated bygenerate_data.py.prometheus_experimental.pycontains a script to train the LLMs based on the rubrics from theprometheus-eval/Feedback-Collectiondataset. It is experimental though -- even though the reward goes up, there is no guarantee that the model is actually better. We hope our script serves as a starting point, and more research is needed.A simple example of using a grader LLM with rubrics
We show how to use rubric-based LLM to provide reward on an addition task. E.g.
Usually, this could be graded by matching the number to the ground truth 333 without needing an LLLM. However, for pedagogical purposes we will grade the response using a language model with rubric. I.e. We will ask a language mode "Does the assistant answer 333?"
Generate an example dataset
To run this, first generate a dataset:
Then you will see two
jsonlfile generated, one for training, one for testing. For example, if you look intotinker_cookbook/example_data/example_rubric_train.jsonl, each datapoint consists ofDebugging and Printing What Happens During Rollouts
Run
You can see the message that the policy sees, its response, the grader input, and the grader output.
An example training run
To train the LLM to add with a rubric-based LLM, run
You can see the reward quickly goes up.
A more realistic dataset
We take the
prometheus-eval/Feedback-Collectiondataset from hugingface, which contains rubric to grade general chat responses. Run the following to kick off training:We can see that the reward climbs up steadily.
Note that this training recipe is experimental -- to make the performance better we may need to fine-tune the grader LLM as well. We hope our code serves as a starting for you to improve rubric-based grading for training LLMs!