⌛️ Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking

🎉 News • 📖 Introduction • ✨ Key Contributions

📊 Data Collection • 🚀 Getting Started • 💡 Usage

📈 Evaluation • 🎈 Citation

🎉 News

[2025-03-02] Countdown-Code was accepted at ICLR 2026 SPOT Workshop! 🥳

📖 Introduction

Countdown-Code is a lightweight, interpretable testbed designed to isolate and measure Reward Hacking (specification gaming) within LLM post-training pipelines.

While Reinforcement Learning with Verifiable Rewards (RLVR) is essential for modern "System 2" reasoning models, it often incentivizes agents to exploit proxy rewards—such as by rewriting test scripts or altering problem definitions—rather than fulfilling the designer's true intent. This project introduces a flexible environment that quantifies the gap between verifiable proxy rewards and actual mathematical correctness, revealing a critical pitfall: supervised fine-tuning on even trace amounts (~1%) of hacking demonstrations data can prime models to catastrophically reward hack during subsequent RL optimization. Beyond providing a controlled environment for diagnostic research, we demonstrate that these misaligned behaviors are not toy-domain artifacts but generalize robustly to realistic coding benchmarks like HumanEval.

✨ Key Contributions

Controlled Environment: Countdown-Code isolates the emergence of reward hacking through a well-defined mathematical task with clear ground truth
Multi-Stage Pipeline: Demonstrates how SFT→RL pipeline progressively amplifies misaligned behavior
Interpretable Metrics: Provides both true and proxy reward signals to quantify specification gaming
Reproducible Benchmarks: Complete implementation and datasets for evaluating post-training safety

📊 Data Collection

We use the Jiayi-Pan/Countdown-Tasks-3to4 dataset from HuggingFace, which contains countdown arithmetic problems. The dataset is split into three distinct subsets:

Generate Dataset Splits

cd datagen
python create_datasets.py

This generates:

SFT Data: Training data for supervised fine-tuning
RLVR Data: Data for RL with value rewards (includes both true and proxy reward signals)
Evaluation Data: Clean evaluation set for measuring hacking rates

SFT Data: Distillation Traces

To generate distillation traces using the OpenAI API:

Create a .env file with your API key:
```
OPENAI_API_KEY=sk-...
```
Generate traces:
```
cd datagen
python openai_inference.py
```

The script uses o4-mini through the OpenAI responses API. If you encounter issues, ensure your organization is verified for your API key.

RLVR Data: Reward Function Setup

Format data for the custom CountdownCodeRewardManager in verl:

cd datagen
python format_for_rl.py

This prepares data so that verl can compute both:

True Reward: Based on actual problem correctness
Proxy Reward: What the model might exploit (placeholder for specification gaming)

🚀 Getting Started

System Requirements

PyTorch 2.6
CUDA 12.8
Flash Attention 2.7.4 (compatible with GLIBC < 2.32)

Installation

Run the setup script to install all dependencies:

bash setup.sh

Alternatively, manual setup:

# Install PyTorch and dependencies
pip install torch==2.6 --index-url https://download.pytorch.org/whl/cu128
pip install flash-attn==2.7.4

# Install verl and other requirements
pip install -e .

💡 Usage

Training with Verl: Supervised Fine-Tuning (SFT)

Fine-tune a model using SFT on the prepared data:

cd verl/reasoning-safety
bash configs/sft/run_qwen2.5-coder-7b-lora.sh

Note: Check and adjust file paths in the script before running.

For other model configurations, see available scripts in configs/sft/.

Merging LoRA Adapters

After training, merge LoRA adapters into the base model:

python -m verl.model_merger merge \
    --backend fsdp \
    --local_dir <path_to_trained_model> \
    --target_dir <output_directory> \
    --merge-lora

RLVR Training

Train a model with RLVR on the corresponding subset:

cd verl/reasoning-safety
bash configs/rlvr/run_qwen2.5-coder-7b-lora.sh

Note: Verify paths and compute settings in the script before launching.

For other setups, see available scripts in configs/rlvr/.

📈 Evaluation

Measuring Specification Gaming with Prime CLI

Evaluate the hacking rate of trained or off-the-shelf models using the COUNTDOWN-CODE environment:

cd environments/countdown_code
# Refer to README.md in this directory for detailed evaluation setup

The evaluation will:

Compare true vs. proxy reward optimization
Quantify specification gaming rate
Provide interpretable feedback on where the model cut corners

For detailed instructions, see environments/countdown_code/README.md.

Troubleshooting

Common Issues

Issue: Import errors when running scripts
Solution: Ensure you've completed the setup and activated the correct environment

Issue: Path errors in shell scripts
Solution: Edit scripts to use absolute paths for your system, or run from the specified directories

Issue: CUDA/GPU errors
Solution: Verify CUDA 12.8 installation and check nvidia-smi for GPU availability

Acknowledgements

This project builds on:

Countdown-Tasks-3to4 dataset
Verl framework for RL training
OpenAI API for distillation traces

🎈 Citation

If you use Countdown-Code in your research, please cite this work:

@article{countdowncode,
      title={Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking in RLVR}, 
      author={Muhammad Khalifa and Zohaib Khan and Omer Tafveez and Hao Peng and Lu Wang},
      year={2026},
      eprint={2603.07084},
      archivePrefix={arXiv}
}

For questions or feedback, please open an issue on GitHub.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
datagen		datagen
environments		environments
verl		verl
.gitignore		.gitignore
README.md		README.md
o4-mini-distillation-16k.jsonl		o4-mini-distillation-16k.jsonl
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⌛️ Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking

🎉 News

📖 Introduction

✨ Key Contributions

📊 Data Collection

Generate Dataset Splits

SFT Data: Distillation Traces

RLVR Data: Reward Function Setup

🚀 Getting Started

System Requirements

Installation

💡 Usage

Training with Verl: Supervised Fine-Tuning (SFT)

Merging LoRA Adapters

RLVR Training

📈 Evaluation

Measuring Specification Gaming with Prime CLI

Troubleshooting

Common Issues

Acknowledgements

🎈 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

⌛️ Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking

🎉 News

📖 Introduction

✨ Key Contributions

📊 Data Collection

Generate Dataset Splits

SFT Data: Distillation Traces

RLVR Data: Reward Function Setup

🚀 Getting Started

System Requirements

Installation

💡 Usage

Training with Verl: Supervised Fine-Tuning (SFT)

Merging LoRA Adapters

RLVR Training

📈 Evaluation

Measuring Specification Gaming with Prime CLI

Troubleshooting

Common Issues

Acknowledgements

🎈 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages