The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning

News

[2025/10/27] An updated version is out! It includes additional experiments and results (Llama model, λ ablation, and more). Check it out here! ✨
[2025/09/18] Our paper is accepted to NeurIPS 2025! 🎉
[2025/06/01] We release our paper and code. 🚀

Quick Start

Installation

Our code is implemented based on verl. We recommend to use docker image provided by verl, please refer to their documents.

Start from a custom environment:

conda create -y -n verl python=3.10.14 && conda activate verl
pip install -e .
pip install vllm==0.8.2
pip install latex2sympy2
pip install fire
pip install tensordict==0.7.2
python -m pip install flash-attn --no-build-isolation

For training Qwen3 models, you will need to upgrade to vllm==0.8.5 and transformers==4.52.2.

Training

PSR, NSR, W-REINFORCE: specify advantage in run_qwen2.5-math-7b_psr_nsr.sh to train the model with PSR, NSR, or W-REINFORCE. For W-REINFORCE, set positive_advantage_weight, which corresponds to the λ in the paper, recommended value is 0.1.

bash run_qwen2.5-math-7b_psr_nsr.sh

PPO

bash run_qwen2.5-math-7b_ppo.sh

GRPO

bash run_qwen2.5-math-7b_grpo.sh

Evaluation

Specify MODEL_PATH and OUTPUT_DIR in eval.sh, then bash eval.sh.

Calculate Pass@k: python calculate_metrics --file_path <file_to_evaluate>

Troubleshoot

Out-of-Memory error: Decrease actor_rollout_ref.actor.ppo_max_token_len_per_gpu, actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu, actor_rollout_ref.ref.log_prob_max_token_len_per_gpu
Frozen after Started a local Ray instance.: Add num_cpus=N to ray.init() in verl/trainer/main_ppo.py, for example, ray.init(num_cpus=4, runtime_env={'env_vars': {'TOKENIZERS_PARALLELISM': 'true', 'NCCL_DEBUG': 'WARN'}})

Citation

If you find our paper or code useful, please consider cite our work:

@article{zhu2025rlvr-decomposed,
  title={The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning},
  author={Zhu, Xinyu and Xia, Mengzhou and Wei, Zhepei and Chen, Wei-Lin and Chen, Danqi and Meng, Yu},
  journal={arXiv preprint arXiv:2506.01347},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
examples		examples
scripts		scripts
verl		verl
.gitignore		.gitignore
.style.yapf		.style.yapf
LICENSE		LICENSE
README.md		README.md
calculate_metrics.py		calculate_metrics.py
download.py		download.py
eval.py		eval.py
eval.sh		eval.sh
evalgrpo.sh		evalgrpo.sh
evalgrpo2.sh		evalgrpo2.sh
evalgrpo3.sh		evalgrpo3.sh
evalnsr.sh		evalnsr.sh
evalnsr2.sh		evalnsr2.sh
evalppo.sh		evalppo.sh
evalppo2.sh		evalppo2.sh
evalpsr.sh		evalpsr.sh
grader.py		grader.py
merge_fsdp copy.py		merge_fsdp copy.py
merge_fsdp.py		merge_fsdp.py
run_curriculum_grpo.sh		run_curriculum_grpo.sh
run_qwen2.5-math-1.5b-instruct_grpo.sh		run_qwen2.5-math-1.5b-instruct_grpo.sh
run_qwen2.5-math-7b_grpo.sh		run_qwen2.5-math-7b_grpo.sh
run_qwen2.5-math-7b_nsr.sh		run_qwen2.5-math-7b_nsr.sh
run_qwen2.5-math-7b_ppo.sh		run_qwen2.5-math-7b_ppo.sh
run_qwen2.5-math-7b_psr_nsr.sh		run_qwen2.5-math-7b_psr_nsr.sh
run_qwen3-4b_grpo.sh		run_qwen3-4b_grpo.sh
run_qwen3-4b_ppo.sh		run_qwen3-4b_ppo.sh
run_qwen3-4b_psr_nsr.sh		run_qwen3-4b_psr_nsr.sh
setup.py		setup.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning

News

Quick Start

Installation

Training

Evaluation

Troubleshoot

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning

News

Quick Start

Installation

Training

Evaluation

Troubleshoot

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages