Skip to content

Conversation

@ruiqi-zhong
Copy link
Contributor

@ruiqi-zhong ruiqi-zhong commented Dec 13, 2025

Rubric-based Grading for LLMs

  • data.py contains the definition for the datapoint class. Each datapoint consists of a conversation prefix and a list of rubric items.
  • generate_data.py generates some example datapoint if you want to run our demo on addition.
  • env.py determines what each rollout will do. It will let the policy read the prefix, generate a response, ask a grader LLM to grade based on a list of rubric items, and finally provide a reward by summing the response of each grader.
  • train.py allows you train LLMs on any dataset saved in our format (specified in data.py). The default script will train on the addition task, whose data is generated by generate_data.py.
  • prometheus_experimental.py contains a script to train the LLMs based on the rubrics from the prometheus-eval/Feedback-Collection dataset. It is experimental though -- even though the reward goes up, there is no guarantee that the model is actually better. We hope our script serves as a starting point, and more research is needed.

A simple example of using a grader LLM with rubrics

We show how to use rubric-based LLM to provide reward on an addition task. E.g.

**User**: What's 233 + 100?
**Assistant**: 333

Usually, this could be graded by matching the number to the ground truth 333 without needing an LLLM. However, for pedagogical purposes we will grade the response using a language model with rubric. I.e. We will ask a language mode "Does the assistant answer 333?"

Generate an example dataset

To run this, first generate a dataset:

python -m tinker_cookbook.recipes.rubric.generate_data

Then you will see two jsonl file generated, one for training, one for testing. For example, if you look into tinker_cookbook/example_data/example_rubric_train.jsonl, each datapoint consists of

  • a convo (the conversation prefix that the policy sees)
  • rubric_items: a list of rubric items that specify what is a good item, how the grader should format the response, and how the grading result should be extracted.
{
  "convo": [
    {
      "role": "user",
      "content": "What is 4 + 5?"
    },
    {
      "role": "assistant",
      "content": "9"
    },
    {
      "role": "user",
      "content": "What is 122 + 12?"
    }
  ],
  "rubric_items": [
    {
      "rubric_str": "Does the chatbot correctly gets the answer 134?",
      "extraction_regex": "<score>(.*)</score>",
      "grader_output_format_instruction": "Please output your score between 0 and 1 wrapped in <score> ... </score>"
    }
  ]
}

Debugging and Printing What Happens During Rollouts

Run

python -m tinker_cookbook.recipes.rubric.debug_env

You can see the message that the policy sees, its response, the grader input, and the grader output.

image

An example training run

To train the LLM to add with a rubric-based LLM, run

python -m tinker_cookbook.recipes.rubric.train

You can see the reward quickly goes up.

image

A more realistic dataset

We take the prometheus-eval/Feedback-Collection dataset from hugingface, which contains rubric to grade general chat responses. Run the following to kick off training:

python -m tinker_cookbook.recipes.rubric.prometheus_experimental

We can see that the reward climbs up steadily.

image

Note that this training recipe is experimental -- to make the performance better we may need to fine-tune the grader LLM as well. We hope our code serves as a starting for you to improve rubric-based grading for training LLMs!

@ruiqi-zhong ruiqi-zhong requested a review from joschu December 13, 2025 18:11
@github-actions
Copy link
Contributor

github-actions bot commented Dec 13, 2025

Claude finished @ruiqi-zhong's task in 1m 9s —— View job


Reviewing PR #166: Very simple example of using rubric

  • Read and understand the changed files
  • Review code quality and best practices
  • Check for potential issues
  • Provide comprehensive feedback

Summary

This PR adds a well-structured example demonstrating rubric-based LLM grading for RL training on an addition task. The code is pedagogically clear and follows the cookbook's established patterns. Below are my observations:


✅ Strengths

  1. Clear pedagogical design: The example effectively demonstrates how to use an LLM grader with rubrics, making the concept accessible even though a simple regex could solve the addition task.

  2. Good architecture: The Rubric, RubricBasedDatapoint, RubricGradedEnv, and builder classes follow the cookbook's established patterns (builder pattern, chz decorators, async methods).

  3. Debug mode: The debug=True flag in RubricGradedEnv provides helpful colored output for understanding the grading process.

  4. Proper async patterns: Uses asyncio.gather for parallel rubric grading in env.py:209.

  5. Follows CLI conventions: The train.py uses the standard CLIConfigConfig pattern with chz.entrypoint.


🔧 Suggestions for Improvement

1. Duplicate import in env.py:17 and env.py:24

import chz  # line 17
...
import chz  # line 24 (duplicate)

2. Duplicate import in train.py:1 and train.py:5

import chz  # line 1
...
import chz  # line 5 (duplicate)

3. Logger defined but not used in env.py:27

logger = logging.getLogger(__name__)

This logger is defined but never used. Consider either:

  • Using logger.warning() instead of print() on lines 83, 86 for consistency with other cookbook code
  • Removing the unused logger

4. Minor: Inconsistent serialization in RubricBasedDatapoint.to_json() (env.py:122-126)

The Rubric objects are serialized via rubric.to_json() which returns a string, then that string is placed in the dict. This means rubric_items become JSON strings inside the outer JSON, requiring double-parsing. Compare with:

# Current (double-encoded):
"rubric_items": [rubric.to_json() for rubric in self.rubric_items]

# Suggested (cleaner):
"rubric_items": [rubric.to_dict() for rubric in self.rubric_items]

Then from_json would use Rubric.from_dict(rubric) instead of Rubric.from_json(rubric). This is a minor point but would make the JSON output more readable.

5. Missing __init__.py

The tinker_cookbook/recipes/rubric/ directory doesn't have an __init__.py. While not strictly required for the module to work, adding one would be consistent with making the directory a proper Python package.

6. Grammar in docstring (env.py:42, generate_data.py line 12)

# Current:
"Does the chatbot correctly gets the answer 134?"

# Suggested:
"Does the chatbot correctly get the answer 134?"

📝 Other Notes

  • No tests: This is a pedagogical example, but consider adding a simple unit test for the Rubric.extract_score() method to ensure the regex extraction works correctly for edge cases.

  • Grader max_tokens: In debug_env.py:30, the grader has max_tokens=64. This might be tight if the grader needs to provide reasoning before the score. The dataset builder uses 2048 which is more appropriate.


Verdict

The PR is a solid pedagogical addition to the cookbook. The suggestions above are minor polish items. The core implementation is correct and well-structured.

@ruiqi-zhong ruiqi-zhong marked this pull request as ready for review December 13, 2025 23:59
@github-actions
Copy link
Contributor

github-actions bot commented Dec 14, 2025

Claude finished @ruiqi-zhong's task in 1m 21s —— View job


Reviewing PR #166: Very simple example of using rubric

  • Read and understand all changed files
  • Review code quality and best practices
  • Check for potential issues
  • Provide comprehensive feedback

Summary

This PR adds a well-structured pedagogical example demonstrating rubric-based LLM grading for RL training. The code follows the cookbook's established patterns (builder pattern, chz decorators, async methods) and provides clear documentation.


🐛 Bug Found

data.py:135 - Type mismatch in RubricDatapointListBuilderFromJsonl

def __call__(self) -> Sequence[RubricBasedDatapoint]:
    datapoints = []
    with open(self.jsonl_path, "r") as f:
        for line in f:
            data = json.loads(line)  # data is a dict
            datapoints.append(RubricBasedDatapoint.from_json(data))  # but from_json expects str
    return datapoints

from_json at line 112 expects a str parameter, but receives a dict. Either:

  • Pass line directly: RubricBasedDatapoint.from_json(line)
  • Or add a from_dict method similar to Rubric.from_dict

✅ Strengths

  1. Clear pedagogical design: The addition task example effectively demonstrates the rubric-based grading concept
  2. Good architecture: Follows cookbook patterns with RubricGradedEnv, RubricGradedEnvGroupBuilder, RubricGradedDataset, and builder classes
  3. Debug mode: The debug=True flag provides helpful colored output for understanding the grading flow
  4. Proper async patterns: Uses asyncio.gather for parallel rubric grading (env.py:107-109)
  5. CLI conventions: Uses standard CLIConfigConfig pattern with chz.entrypoint

🔧 Minor Suggestions

Location Issue Suggestion
generate_data.py:14 Grammar "Does the chatbot correctly gets" → "Does the chatbot correctly get"
data.py:154 Typo "evalaute" → "evaluate"
debug_env.py:51 Low max_tokens Grader max_tokens=64 may be tight if reasoning is needed before the score (dataset builder uses 2048)

📝 Notes

  • No __init__.py: Other recipe directories also don't have one, so this is consistent with the existing pattern.
  • Previous review suggestions: A previous Claude review suggested removing duplicate imports and an unused logger, which are minor polish items but not blockers.

Verdict

The PR is a solid pedagogical addition. The type mismatch bug in data.py:135 should be fixed before merging. The other suggestions are minor polish items.

@ruiqi-zhong ruiqi-zhong changed the title [Very simple example of using rubric] [Cookbook Recipe of Using Rubric] Dec 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants