- [2025-03-02] Countdown-Code was accepted at ICLR 2026 SPOT Workshop! 🥳
Countdown-Code is a lightweight, interpretable testbed designed to isolate and measure Reward Hacking (specification gaming) within LLM post-training pipelines.
While Reinforcement Learning with Verifiable Rewards (RLVR) is essential for modern "System 2" reasoning models, it often incentivizes agents to exploit proxy rewards—such as by rewriting test scripts or altering problem definitions—rather than fulfilling the designer's true intent. This project introduces a flexible environment that quantifies the gap between verifiable proxy rewards and actual mathematical correctness, revealing a critical pitfall: supervised fine-tuning on even trace amounts (~1%) of hacking demonstrations data can prime models to catastrophically reward hack during subsequent RL optimization. Beyond providing a controlled environment for diagnostic research, we demonstrate that these misaligned behaviors are not toy-domain artifacts but generalize robustly to realistic coding benchmarks like HumanEval.
- Controlled Environment: Countdown-Code isolates the emergence of reward hacking through a well-defined mathematical task with clear ground truth
- Multi-Stage Pipeline: Demonstrates how SFT→RL pipeline progressively amplifies misaligned behavior
- Interpretable Metrics: Provides both true and proxy reward signals to quantify specification gaming
- Reproducible Benchmarks: Complete implementation and datasets for evaluating post-training safety
We use the Jiayi-Pan/Countdown-Tasks-3to4 dataset from HuggingFace, which contains countdown arithmetic problems. The dataset is split into three distinct subsets:
cd datagen
python create_datasets.pyThis generates:
- SFT Data: Training data for supervised fine-tuning
- RLVR Data: Data for RL with value rewards (includes both true and proxy reward signals)
- Evaluation Data: Clean evaluation set for measuring hacking rates
To generate distillation traces using the OpenAI API:
-
Create a
.envfile with your API key:OPENAI_API_KEY=sk-... -
Generate traces:
cd datagen python openai_inference.py
The script uses o4-mini through the OpenAI responses API. If you encounter issues, ensure your organization is verified for your API key.
Format data for the custom CountdownCodeRewardManager in verl:
cd datagen
python format_for_rl.pyThis prepares data so that verl can compute both:
- True Reward: Based on actual problem correctness
- Proxy Reward: What the model might exploit (placeholder for specification gaming)
- PyTorch 2.6
- CUDA 12.8
- Flash Attention 2.7.4 (compatible with GLIBC < 2.32)
Run the setup script to install all dependencies:
bash setup.shAlternatively, manual setup:
# Install PyTorch and dependencies
pip install torch==2.6 --index-url https://download.pytorch.org/whl/cu128
pip install flash-attn==2.7.4
# Install verl and other requirements
pip install -e .Fine-tune a model using SFT on the prepared data:
cd verl/reasoning-safety
bash configs/sft/run_qwen2.5-coder-7b-lora.shNote: Check and adjust file paths in the script before running.
For other model configurations, see available scripts in configs/sft/.
After training, merge LoRA adapters into the base model:
python -m verl.model_merger merge \
--backend fsdp \
--local_dir <path_to_trained_model> \
--target_dir <output_directory> \
--merge-loraTrain a model with RLVR on the corresponding subset:
cd verl/reasoning-safety
bash configs/rlvr/run_qwen2.5-coder-7b-lora.shNote: Verify paths and compute settings in the script before launching.
For other setups, see available scripts in configs/rlvr/.
Evaluate the hacking rate of trained or off-the-shelf models using the COUNTDOWN-CODE environment:
cd environments/countdown_code
# Refer to README.md in this directory for detailed evaluation setupThe evaluation will:
- Compare true vs. proxy reward optimization
- Quantify specification gaming rate
- Provide interpretable feedback on where the model cut corners
For detailed instructions, see environments/countdown_code/README.md.
Issue: Import errors when running scripts
Solution: Ensure you've completed the setup and activated the correct environment
Issue: Path errors in shell scripts
Solution: Edit scripts to use absolute paths for your system, or run from the specified directories
Issue: CUDA/GPU errors
Solution: Verify CUDA 12.8 installation and check nvidia-smi for GPU availability
This project builds on:
- Countdown-Tasks-3to4 dataset
- Verl framework for RL training
- OpenAI API for distillation traces
If you use Countdown-Code in your research, please cite this work:
@article{countdowncode,
title={Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking in RLVR},
author={Muhammad Khalifa and Zohaib Khan and Omer Tafveez and Hao Peng and Lu Wang},
year={2026},
eprint={2603.07084},
archivePrefix={arXiv}
}For questions or feedback, please open an issue on GitHub.
