TL;DR: R3 is a novel rubric-agnostic reward modeling framework that delivers interpretable, controllable, and generalizable score assignments. Built on reasoning distillation, targeted data curation, and a two-stage filtering pipeline, R3 achieves state-of-the-art results—even when trained on datasets 10× smaller than many baselines.
- 🤔 Why R3?
- ⚙️ Setup Instructions
- 🚀 Using Our Model (Inference & Deployment)
- 🧩 Using Our Codebase
- 📚 Citation
The table above compares R3 to existing reward models across key dimensions:
- 🧠 Task Diversity: R3 supports point-wise, pairwise, and binary classification tasks—covering instruction following, reasoning, and factuality.
- 📊 Supervision Format: It handles all major data formats: point-wise scores, preference comparisons, and classification-based labels.
- 🧩 Rubric Agnosticism: Unlike many models, R3 does not rely on fixed evaluation rubrics. Instead, it generalizes across rubrics—making it easily customizable for new use cases.
- 📦 Compactness: R3 achieves broad task coverage using just 4k–14k training examples, whereas many baselines use 100k–5M+.
- 🔓 Accessibility: R3 is open and accessible, making it suitable for lightweight deployment, reproducibility, and extensibility.
In short, R3 offers full-spectrum functionality—matching or exceeding other models on task generality and rubric flexibility, all while using an order of magnitude less data.
R3 is trained on a diverse, carefully curated mix of point-wise, pair-wise, and binary tasks dataset as shown in the above diagram:
- 🗣️ General Chat & Instruction-Following: Includes open-domain instruction and user preference datasets such as Tulu, UltraFeedback, and Skywork Reward Preference. These cover both point-wise and pairwise supervision signals.
- 🧠 Reasoning Tasks: Focuses on math and code evaluations using datasets like Math-Step-DPO-10K and AceCodePair-300K, with annotations targeting correctness and reasoning quality.
- ✅ Classification & Factuality: Encompasses factual consistency, summarization quality, and NLI tasks. Key datasets: GLUE, SuperGLUE, SummEval, FeedbackCollection, PreferenceCollection, and EVOUNA. These span both rubric-driven binary classification and pairwise preference setups.
This diverse dataset mix powers R3's generalization and robustness across evaluation dimensions and task formats.
Python 3.10 or higher are recommended. To install required dependencies, simply run pip install -e . as it will automatically setup everything. Details of dependencies are in setup.py.
You can use our R3 models directly from our 🤗 R3 Models Collection.
Below is an example of using our R3-Qwen3-14B-LoRA-4k model using 🤗 transformers:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "rubricreward/R3-Qwen3-14B-LoRA-4k"
# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
# prepare the model input
messages = [
{'content': "Evaluate the response based on the given task, input, response, and evaluation rubric. Provide a fair and detailed assessment following the rubric...", 'role': 'user'}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# conduct text completion
generated_ids = model.generate(
**model_inputs,
max_new_tokens=8192
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
# parsing thinking content
try:
# rindex finding 151668 (</think>)
index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
index = 0
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
print(content)Alternatively, you may also use vLLM for faster inference:
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
model_path = "rubricreward/R3-Qwen3-14B-LoRA-4k"
tokenizer = AutoTokenizer.from_pretrained(model_path)
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=8192, min_p=0, top_k=20)
llm = LLM(
model=model_path,
dtype="auto",
max_model_len=10000,
)
messages: list[dict[str, str]] = [
{'content': "Evaluate the response based on the given task, input, response, and evaluation rubric. Provide a fair and detailed assessment following the rubric...", 'role': 'user'}
]
list_text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True # Switch between thinking and non-thinking modes.
)
outputs = llm.generate(list_text, sampling_params)
print(outputs[0].output.text)This codebase is primarily intended for reproducing our experiments. It consists of the following components:
src/dataset_gen: Dataset generation pipelinesrc/training: Training configuration using LLaMA Factorysrc/evaluation: Evaluation pipeline for the R3 benchmarkdata/eval_configs: Evaluation configurations used in our experimentsdata/rubrics: Automatically generated rubrics
If you found our work helpful, please cite our work using the following citation!
@article{anugraha2025r3,
title={R3: Robust Rubric-Agnostic Reward Models},
author={Anugraha, David and Tang, Zilu and Miranda, Lester James V. and Zhao, Hanyang and Farhansyah, Mohammad Rifqi and Kuwanto, Garry and Wijaya, Derry and Winata, Genta Indra},
journal={arXiv preprint arXiv:2505.13388},
year={2025}
}If you have any questions, you can open a GitHub Issue!


