Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering

Our paper shows how reasoning models can use additional test-time compute to improve their confidence allocation and deliver stronger performance in selective question answering.

Installation

Be sure to use our version of vllm, altered from the original s1 repo

pip install -r requirements.txt
cd eval/lm-evaluation-harness
pip install -e .[vllm]

Usage

First run:

scripts/generate_chains.sh

Then, run

scripts/incremental_answers.sh

Then use notebooks/figures_aime.ipynb to recreate the plots from the paper.

Citing

If you find our paper or code useful, consider citing us:

@misc{jurayj2025finalanswertesttimescaling,
      title={Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering}, 
      author={William Jurayj and Jeffrey Cheng and Benjamin Van Durme},
      year={2025},
      eprint={2502.13962},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.13962}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
eval		eval
images		images
notebooks		notebooks
scripts		scripts
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering

Installation

Usage

Citing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering

Installation

Usage

Citing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages