GameplayQA

GameplayQA is a benchmark for decision-dense, POV-synced video question answering in 3D gameplay environments. It provides multiple-choice questions grounded in short gameplay segments, including single-view temporal reasoning and synchronized multi-view reasoning. The benchmark is described in the paper GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents.

The released dataset is on Hugging Face (wangyz1999/GameplayQA). This repository contains:

data/qa.csv: the main GameplayQA benchmark.
data/qa_generalization.csv: the generalization benchmark.
data/annotation/: source videos and annotation assets referenced by the CSV files.

The commands below run the standard benchmark workflow. Full command-line options are documented in args.md.

Quick Start

We use uv for package management.

This repository requires FFmpeg installed and available on your PATH for video preprocessing.

uv sync                           # intall dependencies
uv run download_data.py           # download data from huggingface

Rename .env.example to .env, then edit .env and set the API keys for the model providers you plan to use (you only need the keys you will actually call).

Standard Workflow

Dataset split (`--split`)

qa: main split of gameplay videos, questions in data/qa.csv (default if you omit --split).
generalization: generalization split, questions in data/qa_generalization.csv.

Parallel workers (`--workers`)

run_benchmark.py: -w / --workers sets how many questions are benchmarked in parallel (default 5). Increase it if your API rate limits allow; decrease it if you see throttling or want a lighter load.
With --judge: --judge-workers controls parallel judging workers (default 5).
process_data.py: --workers controls parallel frame extraction (default 4); see args.md.

Prepare the benchmark split. This crops each question's video segment and builds the frame cache used by frame-based vision models.

uv run process_data.py --split qa

Run a model on the prepared split.

uv run run_benchmark.py --split qa --model gpt-5-mini --judge

Example with higher parallelism (tune to your API limits):

uv run run_benchmark.py --split qa --model gpt-5-mini --judge -workers 8 --judge-workers 8

If you run the benchmark without --judge, judge the raw model responses separately.

uv run llm_as_a_judge.py --split qa --model gpt-5-mini

Evaluate and plot the judged results.

uv run evaluate_results.py --split qa
uv run plot_results.py --split qa

The same workflow can be run on the generalization split by replacing qa with generalization.

Outputs

The default output paths are stable, so each pipeline step can automatically find the previous step's result:

Raw benchmark results: results/raw/<split>/<model>.jsonl
Judged results: results/judged/<split>/<model>_judged.jsonl
Evaluation tables: results/evaluation/<split>/
Plots: results/plots/<split>/

Ablation Studies

Run an ablation benchmark, judge it, and include it in evaluation:

uv run ablation_benchmark.py --split qa --model gpt-5-mini --ablation no_video
uv run llm_as_a_judge.py --split qa --model gpt-5-mini --ablation no_video
uv run evaluate_ablation.py --split qa

Available ablations are no_video, random_single_frame, and shuffled_frames.

Citation

If you use GameplayQA, please cite the paper:

@article{wang2026gameplayqa,
  title   = {GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents},
  author  = {Wang, Yunzhe and Xu, Runhui and Zheng, Kexin and Zhang, Tianyi and Kogundi, Jayavibhav Niranjan and Hans, Soham and Ustun, Volkan},
  year    = {2026},
  eprint  = {2603.24329},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url     = {https://arxiv.org/abs/2603.24329}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GameplayQA

Quick Start

Standard Workflow

Dataset split (`--split`)

Parallel workers (`--workers`)

Outputs

Ablation Studies

Citation

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
gameplayqa		gameplayqa
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
ablation_benchmark.py		ablation_benchmark.py
args.md		args.md
download_data.py		download_data.py
evaluate_ablation.py		evaluate_ablation.py
evaluate_results.py		evaluate_results.py
llm_as_a_judge.py		llm_as_a_judge.py
main.py		main.py
plot_results.py		plot_results.py
process_data.py		process_data.py
pyproject.toml		pyproject.toml
run_benchmark.py		run_benchmark.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

GameplayQA

Quick Start

Standard Workflow

Dataset split (--split)

Parallel workers (--workers)

Outputs

Ablation Studies

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

Dataset split (`--split`)

Parallel workers (`--workers`)