thinking-machines-lab · ruiqi-zhong · Dec 13, 2025 · Dec 13, 2025 · Dec 13, 2025 · Dec 13, 2025
diff --git a/tinker_cookbook/recipes/rubric/README.md b/tinker_cookbook/recipes/rubric/README.md
@@ -0,0 +1,94 @@
+# Rubric-based Grading for LLMs
+
+- [`data.py`](./data.py) contains the definition for the datapoint class. Each datapoint consists of a conversation prefix and a list of rubric items.
+- [`generate_data.py`](./generate_data.py) generates some example datapoints if you want to run our demo on addition.
+- [`env.py`](./env.py) determines what each rollout will do. It will let the policy read the prefix, generate a response, ask a grader LLM to grade based on a list of rubric items, and finally provide a reward by summing the response of each grader.
+- [`train.py`](./train.py) allows you to train LLMs on any dataset saved in our format (specified in `data.py`). The default script will train on the addition task, whose data is generated by `generate_data.py`.
+- [`prometheus_experimental.py`](./prometheus_experimental.py) contains a script to train the LLMs based on the rubrics from the [`prometheus-eval/Feedback-Collection`](https://huggingface.co/datasets/prometheus-eval/Feedback-Collection/viewer/default/train?row=0&views%5B%5D=train) dataset. It is experimental though -- even though the reward goes up, there is no guarantee that the model is actually better. We hope our script serves as a starting point, and more research is needed.
+
+
+## A simple example of using a grader LLM with rubrics
+
+We show how to use a rubric-based LLM to provide a reward for an addition task. E.g.
+
+```
+**User**: What's 233 + 100?
+**Assistant**: 333
+```
+
+Usually, this could be graded by matching the number to the ground truth 333 without needing an LLM. However, for pedagogical purposes, we will grade the response using a language model with a rubric. That is, we will ask a language model "Does the assistant answer 333?"
+
+### Generate an example dataset
+
+To run this, first generate a dataset:
+
+```
+python -m tinker_cookbook.recipes.rubric.generate_data
+```
+
+Then you will see two `jsonl` files generated, one for training, one for testing. For example, if you look into `tinker_cookbook/example_data/example_rubric_train.jsonl`, each datapoint consists of
+- a convo (the conversation prefix that the policy sees)
+- rubric_items: a list of rubric items that specify what is a good response, how the grader should format the response, and how the grading result should be extracted.
+
+```
+{
+  "convo": [
+    {
+      "role": "user",
+      "content": "What is 4 + 5?"
+    },
+    {
+      "role": "assistant",
+      "content": "9"
+    },
+    {
+      "role": "user",
+      "content": "What is 122 + 12?"
+    }
+  ],
+  "rubric_items": [
+    {
+      "rubric_str": "Does the chatbot correctly get the answer 134?",
+      "extraction_regex": "<score>(.*)</score>",
+      "grader_output_format_instruction": "Please output your score between 0 and 1 wrapped in <score> ... </score>"
+    }
+  ]
+}
+```
+
+### Debugging and Printing What Happens During Rollouts
+
+Run
+```
+python -m tinker_cookbook.recipes.rubric.debug_env
+```
+
+You can see the message that the policy sees, its response, the grader input, and the grader output.
+
+<img width="1168" height="771" alt="image" src="https://github.com/user-attachments/assets/9f4e3c89-f21e-49b0-96d6-e2f27bd21b43" />
+
+
+### An example training run
+
+To train the LLM to add with a rubric-based LLM, run
+```
+python -m tinker_cookbook.recipes.rubric.train
+```
+
+You can see the reward quickly goes up.
+
+<img width="705" height="279" alt="image" src="https://github.com/user-attachments/assets/2f825805-20a7-4cf3-8d06-55d5e9a98098" />
+
+### A more realistic dataset
+
+We take the `prometheus-eval/Feedback-Collection` dataset from [Hugging Face](https://huggingface.co/datasets/prometheus-eval/Feedback-Collection/), which contains rubrics to grade general chat responses. Run the following to kick off training:
+
+```
+python -m tinker_cookbook.recipes.rubric.prometheus_experimental
+```
+
+We can see that the reward climbs up steadily.
+
+<img width="1086" height="514" alt="image" src="https://github.com/user-attachments/assets/8877ea6c-b9ea-46da-b995-046bbd3e7c80" />
+
+Note that this training recipe is experimental -- to make the performance better we may need to fine-tune the grader LLM as well. We hope our code serves as a starting point for you to improve rubric-based grading for training LLMs!
diff --git a/tinker_cookbook/recipes/rubric/data.py b/tinker_cookbook/recipes/rubric/data.py
@@ -0,0 +1,169 @@
+from tinker_cookbook.renderers import (
+    Message,
+    Role,
+)
+from typing import TypeAlias
+from dataclasses import dataclass
+from typing import Sequence
+import re
+import json
+import chz
+
+Conversation: TypeAlias = list[Message]
+
+
+@dataclass
+class Rubric:
+    """
+    A rubric should specify 1) what counts as a good response, 2) how the grader language model should output the score, and 3) how to extract the score from the grader's response.
+    """
+
+    rubric_str: str
+    extraction_regex: str = r"<score>(.*)</score>"
+    grader_output_format_instruction: str = (
+        "Please output your score between 0 and 1 wrapped in <score> ... </score>"
+    )
+
+    def __convert_role(self, role: Role) -> str:
+        return "Human" if role in ("user", "system") else "Chatbot"
+
+    def _flatten_convo(self, convo: Conversation) -> str:
+        """
+        Convert the whole conversation (user's turns + assistant's turns) into a single string. E.g.
+        \n\nHuman: ....
+        \n\nChatbot: ...
+        \n\nHuman: ...
+        \n\nChatbot: ...
+        """
+        return "\n\n".join(
+            [f"{self.__convert_role(message['role'])}: {message['content']}" for message in convo]
+        )
+
+    def get_grader_prompt(self, convo: Conversation) -> Conversation:
+        """
+        Create a prompt for the grader to grade the conversation based on the rubric. The prompt should contain 1) the conversation to be graded, and 2) the rubric.
+        """
+
+        prompt = "I will show you 1) a conversation between a human and a chatbot, and 2) a rubric for grading the conversation. Please grade the conversation based on the rubric."
+
+        prompt += f"Here is the conversation: <conversation>\n\n{self._flatten_convo(convo)} \n\n</conversation>\n\nHere is the rubric: <rubric>\n{self.rubric_str}\n</rubric>\n"
+        prompt += f"Please grade the conversation based on the rubric. {self.grader_output_format_instruction}"
+        return [
+            {
+                "role": "user",
+                "content": prompt,
+            }
+        ]
+
+    def extract_score(self, response: str) -> float:
+        match = re.search(self.extraction_regex, response, re.DOTALL)
+        if match is not None:
+            try:
+                return float(match.group(1))
+            except ValueError:
+                print(f"Warning: Failed to extract score from grader response: {response}")
+                return 0.0
+        else:
+            print(f"Warning: Failed to extract score from grader response: {response}")
+            return 0.0
+
+    def to_dict(self) -> dict[str, str]:
+        return {
+            "rubric_str": self.rubric_str,
+            "extraction_regex": self.extraction_regex,
+            "grader_output_format_instruction": self.grader_output_format_instruction,
+        }
+
+    def to_json(self) -> str:
+        return json.dumps(self.to_dict())
+
+    @staticmethod
+    def from_dict(d: dict[str, str]) -> "Rubric":
+        return Rubric(
+            rubric_str=d["rubric_str"],
+            extraction_regex=d["extraction_regex"],
+            grader_output_format_instruction=d["grader_output_format_instruction"],
+        )
+
+    @staticmethod
+    def from_json(json_str: str) -> "Rubric":
+        return Rubric.from_dict(json.loads(json_str))
+
+
+@dataclass(frozen=True)
+class RubricBasedDatapoint:
+    """
+    A rubric-based datapoint contains a conversation and a rubric.
+    In this task, the policy model sees the conversation, create a response, and then the grader language model grades the response based on the rubric.
+    """
+
+    convo: Conversation
+    rubric_items: Sequence[Rubric]
+
+    def to_json(self) -> str:
+        return json.dumps(
+            {
+                "convo": self.convo,
+                "rubric_items": [rubric.to_dict() for rubric in self.rubric_items],
+            }
+        )
+
+    @staticmethod
+    def from_json(json_str: str) -> "RubricBasedDatapoint":
+        d = json.loads(json_str)
+        return RubricBasedDatapoint(
+            convo=d["convo"],
+            rubric_items=[Rubric.from_dict(rubric) for rubric in d["rubric_items"]],
+        )
+
+
+@chz.chz
+class RubricDatapointListBuilder:
+    def __call__(self) -> Sequence[RubricBasedDatapoint]:
+        raise NotImplementedError("Subclass must implement this method")
+
+
+@chz.chz
+class RubricDatapointListBuilderFromJsonl(RubricDatapointListBuilder):
+    jsonl_path: str
+
+    def __call__(self) -> Sequence[RubricBasedDatapoint]:
+        datapoints = []
+        with open(self.jsonl_path, "r") as f:
+            for line in f:
+                datapoints.append(RubricBasedDatapoint.from_json(line))
+        return datapoints
+
+
+@chz.chz
+class PrometheusDatapointListBuilder(RubricDatapointListBuilder):
+    data_path: str = "prometheus-eval/Feedback-Collection"
+
+    def __call__(self) -> Sequence[RubricBasedDatapoint]:
+        from datasets import load_dataset
+
+        train_dataset = load_dataset(self.data_path)["train"]
+        return [self.build_rubric_datapoint(item) for item in train_dataset]  # type: ignore
+
+    def build_rubric_datapoint(self, item: dict) -> RubricBasedDatapoint:
+        convo: Conversation = [
+            {"role": "user", "content": item["orig_instruction"]},
+        ]
+
+        rubric_text = f"Your job is to evaluate the following: {item['orig_criteria']}. Your response should be a score between 1 to 5.\n"
+        rubric_text += "Here is the calibration for each score:\n"
+        for i in range(1, 6):
+            rubric_text += f"<score>{i}.0</score>: {item[f'orig_score{i}_description']}\n"
+
+        rubric_text += f"\nHere is a reference response that achieved a score of 5: {item['orig_reference_answer']}\n"
+
+        rubric = Rubric(
+            rubric_str=rubric_text,
+            extraction_regex=r"<score>(.*)</score>",
+            grader_output_format_instruction="Please output your score between 1 and 5 wrapped in <score> ... </score>",
+        )
+
+        return RubricBasedDatapoint(
+            convo=convo,
+            rubric_items=[rubric],
+        )
diff --git a/tinker_cookbook/recipes/rubric/debug_env.py b/tinker_cookbook/recipes/rubric/debug_env.py
@@ -0,0 +1,74 @@
+from tinker_cookbook import model_info
+from tinker_cookbook.recipes.rubric.env import RubricGradedEnv, RubricBasedDatapoint, Rubric
+from tinker_cookbook.completers import TinkerMessageCompleter, TinkerTokenCompleter
+from tinker_cookbook.renderers import get_renderer
+from tinker_cookbook.tokenizer_utils import get_tokenizer
+import tinker
+from tinker_cookbook.rl.rollouts import do_single_rollout
+import asyncio
+
+
+def get_addition_datapoint() -> RubricBasedDatapoint:
+    datapoint = RubricBasedDatapoint(
+        convo=[
+            {"role": "user", "content": "What is 4 + 5?"},
+            {"role": "assistant", "content": "9"},
+            {"role": "user", "content": "What is 125 + 311?"},
+        ],
+        rubric_items=[
+            Rubric(rubric_str="Does the chatbot correctly get the answer 436?"),
+            Rubric(rubric_str="Does the chatbot provide an answer without saying anything else?"),
+        ],
+    )
+
+    return datapoint
+
+
+def get_prometheus_datapoint() -> RubricBasedDatapoint:
+    from tinker_cookbook.recipes.rubric.data import PrometheusDatapointListBuilder
+
+    datapoint = PrometheusDatapointListBuilder()()
+    datapoint = datapoint[0]
+    return datapoint
+
+
+async def main(datapoint: RubricBasedDatapoint):
+    policy_name = "meta-llama/Llama-3.1-8B-Instruct"
+    grader_name = "Qwen/Qwen3-30B-A3B-Instruct-2507"
+    service_client = tinker.ServiceClient()
+    policy = TinkerTokenCompleter(
+        sampling_client=service_client.create_sampling_client(base_model=policy_name),
+        max_tokens=64,
+    )
+    policy_renderer = get_renderer(
+        model_info.get_recommended_renderer_name(policy_name), get_tokenizer(policy_name)
+    )
+    grader = TinkerMessageCompleter(
+        sampling_client=service_client.create_sampling_client(base_model=grader_name),
+        renderer=get_renderer(
+            model_info.get_recommended_renderer_name(grader_name), get_tokenizer(grader_name)
+        ),
+        max_tokens=64,
+    )
+
+    env = RubricGradedEnv(
+        renderer=policy_renderer,
+        datapoint=datapoint,
+        grader_llm=grader,
+        debug=True,
+    )
+
+    await do_single_rollout(policy, env)
+
+
+if __name__ == "__main__":
+    dataset = "addition"
+
+    if dataset == "addition":
+        datapoint = get_addition_datapoint()
+        asyncio.run(main(datapoint))
+    elif dataset == "prometheus":
+        datapoint = get_prometheus_datapoint()
+        asyncio.run(main(datapoint))
+    else:
+        raise ValueError(f"Unknown dataset: {dataset}")