Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 94 additions & 0 deletions tinker_cookbook/recipes/rubric/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# Rubric-based Grading for LLMs

- [`data.py`](./data.py) contains the definition for the datapoint class. Each datapoint consists of a conversation prefix and a list of rubric items.
- [`generate_data.py`](./generate_data.py) generates some example datapoints if you want to run our demo on addition.
- [`env.py`](./env.py) determines what each rollout will do. It will let the policy read the prefix, generate a response, ask a grader LLM to grade based on a list of rubric items, and finally provide a reward by summing the response of each grader.
- [`train.py`](./train.py) allows you to train LLMs on any dataset saved in our format (specified in `data.py`). The default script will train on the addition task, whose data is generated by `generate_data.py`.
- [`prometheus_experimental.py`](./prometheus_experimental.py) contains a script to train the LLMs based on the rubrics from the [`prometheus-eval/Feedback-Collection`](https://huggingface.co/datasets/prometheus-eval/Feedback-Collection/viewer/default/train?row=0&views%5B%5D=train) dataset. It is experimental though -- even though the reward goes up, there is no guarantee that the model is actually better. We hope our script serves as a starting point, and more research is needed.


## A simple example of using a grader LLM with rubrics

We show how to use a rubric-based LLM to provide a reward for an addition task. E.g.

```
**User**: What's 233 + 100?
**Assistant**: 333
```

Usually, this could be graded by matching the number to the ground truth 333 without needing an LLM. However, for pedagogical purposes, we will grade the response using a language model with a rubric. That is, we will ask a language model "Does the assistant answer 333?"

### Generate an example dataset

To run this, first generate a dataset:

```
python -m tinker_cookbook.recipes.rubric.generate_data
```

Then you will see two `jsonl` files generated, one for training, one for testing. For example, if you look into `tinker_cookbook/example_data/example_rubric_train.jsonl`, each datapoint consists of
- a convo (the conversation prefix that the policy sees)
- rubric_items: a list of rubric items that specify what is a good response, how the grader should format the response, and how the grading result should be extracted.

```
{
"convo": [
{
"role": "user",
"content": "What is 4 + 5?"
},
{
"role": "assistant",
"content": "9"
},
{
"role": "user",
"content": "What is 122 + 12?"
}
],
"rubric_items": [
{
"rubric_str": "Does the chatbot correctly get the answer 134?",
"extraction_regex": "<score>(.*)</score>",
"grader_output_format_instruction": "Please output your score between 0 and 1 wrapped in <score> ... </score>"
}
]
}
```

### Debugging and Printing What Happens During Rollouts

Run
```
python -m tinker_cookbook.recipes.rubric.debug_env
```

You can see the message that the policy sees, its response, the grader input, and the grader output.

<img width="1168" height="771" alt="image" src="https://github.com/user-attachments/assets/9f4e3c89-f21e-49b0-96d6-e2f27bd21b43" />


### An example training run

To train the LLM to add with a rubric-based LLM, run
```
python -m tinker_cookbook.recipes.rubric.train
```

You can see the reward quickly goes up.

<img width="705" height="279" alt="image" src="https://github.com/user-attachments/assets/2f825805-20a7-4cf3-8d06-55d5e9a98098" />

### A more realistic dataset

We take the `prometheus-eval/Feedback-Collection` dataset from [Hugging Face](https://huggingface.co/datasets/prometheus-eval/Feedback-Collection/), which contains rubrics to grade general chat responses. Run the following to kick off training:

```
python -m tinker_cookbook.recipes.rubric.prometheus_experimental
```

We can see that the reward climbs up steadily.

<img width="1086" height="514" alt="image" src="https://github.com/user-attachments/assets/8877ea6c-b9ea-46da-b995-046bbd3e7c80" />

Note that this training recipe is experimental -- to make the performance better we may need to fine-tune the grader LLM as well. We hope our code serves as a starting point for you to improve rubric-based grading for training LLMs!
169 changes: 169 additions & 0 deletions tinker_cookbook/recipes/rubric/data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
from tinker_cookbook.renderers import (
Message,
Role,
)
from typing import TypeAlias
from dataclasses import dataclass
from typing import Sequence
import re
import json
import chz

Conversation: TypeAlias = list[Message]


@dataclass
class Rubric:
"""
A rubric should specify 1) what counts as a good response, 2) how the grader language model should output the score, and 3) how to extract the score from the grader's response.
"""

rubric_str: str
extraction_regex: str = r"<score>(.*)</score>"
grader_output_format_instruction: str = (
"Please output your score between 0 and 1 wrapped in <score> ... </score>"
)

def __convert_role(self, role: Role) -> str:
return "Human" if role in ("user", "system") else "Chatbot"

def _flatten_convo(self, convo: Conversation) -> str:
"""
Convert the whole conversation (user's turns + assistant's turns) into a single string. E.g.
\n\nHuman: ....
\n\nChatbot: ...
\n\nHuman: ...
\n\nChatbot: ...
"""
return "\n\n".join(
[f"{self.__convert_role(message['role'])}: {message['content']}" for message in convo]
)

def get_grader_prompt(self, convo: Conversation) -> Conversation:
"""
Create a prompt for the grader to grade the conversation based on the rubric. The prompt should contain 1) the conversation to be graded, and 2) the rubric.
"""

prompt = "I will show you 1) a conversation between a human and a chatbot, and 2) a rubric for grading the conversation. Please grade the conversation based on the rubric."

prompt += f"Here is the conversation: <conversation>\n\n{self._flatten_convo(convo)} \n\n</conversation>\n\nHere is the rubric: <rubric>\n{self.rubric_str}\n</rubric>\n"
prompt += f"Please grade the conversation based on the rubric. {self.grader_output_format_instruction}"
return [
{
"role": "user",
"content": prompt,
}
]

def extract_score(self, response: str) -> float:
match = re.search(self.extraction_regex, response, re.DOTALL)
if match is not None:
try:
return float(match.group(1))
except ValueError:
print(f"Warning: Failed to extract score from grader response: {response}")
return 0.0
else:
print(f"Warning: Failed to extract score from grader response: {response}")
return 0.0

def to_dict(self) -> dict[str, str]:
return {
"rubric_str": self.rubric_str,
"extraction_regex": self.extraction_regex,
"grader_output_format_instruction": self.grader_output_format_instruction,
}

def to_json(self) -> str:
return json.dumps(self.to_dict())

@staticmethod
def from_dict(d: dict[str, str]) -> "Rubric":
return Rubric(
rubric_str=d["rubric_str"],
extraction_regex=d["extraction_regex"],
grader_output_format_instruction=d["grader_output_format_instruction"],
)

@staticmethod
def from_json(json_str: str) -> "Rubric":
return Rubric.from_dict(json.loads(json_str))


@dataclass(frozen=True)
class RubricBasedDatapoint:
"""
A rubric-based datapoint contains a conversation and a rubric.
In this task, the policy model sees the conversation, create a response, and then the grader language model grades the response based on the rubric.
"""

convo: Conversation
rubric_items: Sequence[Rubric]

def to_json(self) -> str:
return json.dumps(
{
"convo": self.convo,
"rubric_items": [rubric.to_dict() for rubric in self.rubric_items],
}
)

@staticmethod
def from_json(json_str: str) -> "RubricBasedDatapoint":
d = json.loads(json_str)
return RubricBasedDatapoint(
convo=d["convo"],
rubric_items=[Rubric.from_dict(rubric) for rubric in d["rubric_items"]],
)


@chz.chz
class RubricDatapointListBuilder:
def __call__(self) -> Sequence[RubricBasedDatapoint]:
raise NotImplementedError("Subclass must implement this method")


@chz.chz
class RubricDatapointListBuilderFromJsonl(RubricDatapointListBuilder):
jsonl_path: str

def __call__(self) -> Sequence[RubricBasedDatapoint]:
datapoints = []
with open(self.jsonl_path, "r") as f:
for line in f:
datapoints.append(RubricBasedDatapoint.from_json(line))
return datapoints


@chz.chz
class PrometheusDatapointListBuilder(RubricDatapointListBuilder):
data_path: str = "prometheus-eval/Feedback-Collection"

def __call__(self) -> Sequence[RubricBasedDatapoint]:
from datasets import load_dataset

train_dataset = load_dataset(self.data_path)["train"]
return [self.build_rubric_datapoint(item) for item in train_dataset] # type: ignore

def build_rubric_datapoint(self, item: dict) -> RubricBasedDatapoint:
convo: Conversation = [
{"role": "user", "content": item["orig_instruction"]},
]

rubric_text = f"Your job is to evaluate the following: {item['orig_criteria']}. Your response should be a score between 1 to 5.\n"
rubric_text += "Here is the calibration for each score:\n"
for i in range(1, 6):
rubric_text += f"<score>{i}.0</score>: {item[f'orig_score{i}_description']}\n"

rubric_text += f"\nHere is a reference response that achieved a score of 5: {item['orig_reference_answer']}\n"

rubric = Rubric(
rubric_str=rubric_text,
extraction_regex=r"<score>(.*)</score>",
grader_output_format_instruction="Please output your score between 1 and 5 wrapped in <score> ... </score>",
)

return RubricBasedDatapoint(
convo=convo,
rubric_items=[rubric],
)
74 changes: 74 additions & 0 deletions tinker_cookbook/recipes/rubric/debug_env.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
from tinker_cookbook import model_info
from tinker_cookbook.recipes.rubric.env import RubricGradedEnv, RubricBasedDatapoint, Rubric
from tinker_cookbook.completers import TinkerMessageCompleter, TinkerTokenCompleter
from tinker_cookbook.renderers import get_renderer
from tinker_cookbook.tokenizer_utils import get_tokenizer
import tinker
from tinker_cookbook.rl.rollouts import do_single_rollout
import asyncio


def get_addition_datapoint() -> RubricBasedDatapoint:
datapoint = RubricBasedDatapoint(
convo=[
{"role": "user", "content": "What is 4 + 5?"},
{"role": "assistant", "content": "9"},
{"role": "user", "content": "What is 125 + 311?"},
],
rubric_items=[
Rubric(rubric_str="Does the chatbot correctly get the answer 436?"),
Rubric(rubric_str="Does the chatbot provide an answer without saying anything else?"),
],
)

return datapoint


def get_prometheus_datapoint() -> RubricBasedDatapoint:
from tinker_cookbook.recipes.rubric.data import PrometheusDatapointListBuilder

datapoint = PrometheusDatapointListBuilder()()
datapoint = datapoint[0]
return datapoint


async def main(datapoint: RubricBasedDatapoint):
policy_name = "meta-llama/Llama-3.1-8B-Instruct"
grader_name = "Qwen/Qwen3-30B-A3B-Instruct-2507"
service_client = tinker.ServiceClient()
policy = TinkerTokenCompleter(
sampling_client=service_client.create_sampling_client(base_model=policy_name),
max_tokens=64,
)
policy_renderer = get_renderer(
model_info.get_recommended_renderer_name(policy_name), get_tokenizer(policy_name)
)
grader = TinkerMessageCompleter(
sampling_client=service_client.create_sampling_client(base_model=grader_name),
renderer=get_renderer(
model_info.get_recommended_renderer_name(grader_name), get_tokenizer(grader_name)
),
max_tokens=64,
)

env = RubricGradedEnv(
renderer=policy_renderer,
datapoint=datapoint,
grader_llm=grader,
debug=True,
)

await do_single_rollout(policy, env)


if __name__ == "__main__":
dataset = "addition"

if dataset == "addition":
datapoint = get_addition_datapoint()
asyncio.run(main(datapoint))
elif dataset == "prometheus":
datapoint = get_prometheus_datapoint()
asyncio.run(main(datapoint))
else:
raise ValueError(f"Unknown dataset: {dataset}")
Loading