GameplayQA is a benchmark for decision-dense, POV-synced video question answering in 3D gameplay environments. It provides multiple-choice questions grounded in short gameplay segments, including single-view temporal reasoning and synchronized multi-view reasoning. The benchmark is described in the paper GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents.
The released dataset is on Hugging Face (wangyz1999/GameplayQA). This repository contains:
data/qa.csv: the main GameplayQA benchmark.data/qa_generalization.csv: the generalization benchmark.data/annotation/: source videos and annotation assets referenced by the CSV files.
The commands below run the standard benchmark workflow. Full command-line options are documented in args.md.
We use uv for package management.
This repository requires FFmpeg installed and available on your PATH for video preprocessing.
uv sync # intall dependencies
uv run download_data.py # download data from huggingfaceRename .env.example to .env, then edit .env and set the API keys for the model providers you plan to use (you only need the keys you will actually call).
qa: main split of gameplay videos, questions indata/qa.csv(default if you omit--split).generalization: generalization split, questions indata/qa_generalization.csv.
run_benchmark.py:-w/--workerssets how many questions are benchmarked in parallel (default 5). Increase it if your API rate limits allow; decrease it if you see throttling or want a lighter load.- With
--judge:--judge-workerscontrols parallel judging workers (default 5). process_data.py:--workerscontrols parallel frame extraction (default 4); see args.md.
Prepare the benchmark split. This crops each question's video segment and builds the frame cache used by frame-based vision models.
uv run process_data.py --split qaRun a model on the prepared split.
uv run run_benchmark.py --split qa --model gpt-5-mini --judgeExample with higher parallelism (tune to your API limits):
uv run run_benchmark.py --split qa --model gpt-5-mini --judge -workers 8 --judge-workers 8If you run the benchmark without --judge, judge the raw model responses separately.
uv run llm_as_a_judge.py --split qa --model gpt-5-miniEvaluate and plot the judged results.
uv run evaluate_results.py --split qa
uv run plot_results.py --split qaThe same workflow can be run on the generalization split by replacing qa with generalization.
The default output paths are stable, so each pipeline step can automatically find the previous step's result:
- Raw benchmark results:
results/raw/<split>/<model>.jsonl - Judged results:
results/judged/<split>/<model>_judged.jsonl - Evaluation tables:
results/evaluation/<split>/ - Plots:
results/plots/<split>/
Run an ablation benchmark, judge it, and include it in evaluation:
uv run ablation_benchmark.py --split qa --model gpt-5-mini --ablation no_video
uv run llm_as_a_judge.py --split qa --model gpt-5-mini --ablation no_video
uv run evaluate_ablation.py --split qaAvailable ablations are no_video, random_single_frame, and shuffled_frames.
If you use GameplayQA, please cite the paper:
@article{wang2026gameplayqa,
title = {GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents},
author = {Wang, Yunzhe and Xu, Runhui and Zheng, Kexin and Zhang, Tianyi and Kogundi, Jayavibhav Niranjan and Hans, Soham and Ustun, Volkan},
year = {2026},
eprint = {2603.24329},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2603.24329}
}