Our paper shows how reasoning models can use additional test-time compute to improve their confidence allocation and deliver stronger performance in selective question answering.
Be sure to use our version of vllm, altered from the original s1 repo
pip install -r requirements.txt
cd eval/lm-evaluation-harness
pip install -e .[vllm]First run:
scripts/generate_chains.shThen, run
scripts/incremental_answers.shThen use notebooks/figures_aime.ipynb to recreate the plots from the paper.
If you find our paper or code useful, consider citing us:
@misc{jurayj2025finalanswertesttimescaling,
title={Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering},
author={William Jurayj and Jeffrey Cheng and Benjamin Van Durme},
year={2025},
eprint={2502.13962},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.13962},
}