MiroEval is a comprehensive evaluation framework for Deep Research systems, providing automated task generation and assessment across three complementary dimensions: Factual correctness, Point-wise quality, and Process quality.
All three evaluation modules share a unified data/ directory as their input data source. Each sub-project manages its own .env file for API keys (see .env.template in each sub-project).
MiroEval/
├── task_generation/ # Evaluation task generation pipeline
├── data/ # Shared data directory
│ ├── input_queries/ # Evaluation query sets + multimodal attachments
│ └── detail_results/ # Per-task per-model intermediate scores
├── factual_eval/ # Factual evaluation (MiroFlow-based fact-checking agent)
├── point_quality/ # Quality evaluation (adaptive point-wise scoring)
└── process_eval/ # Process evaluation (intrinsic process quality + report alignment)
Note: Model result files (one JSON array per model) should be placed in user-created directories such as
data/method_results/(text-only) anddata/method_multimodal_results/(multimodal). These directories are not included in the repository and must be created before running evaluations.
Automated pipeline for generating high-quality deep-research evaluation queries. Combines anonymized seed patterns from real user queries, real-time web trends, LLM generation, and multi-stage filtering (search validation, deep-research necessity, quality gating) to produce challenging evaluation tasks.
See task_generation/README.md for full details.
| File / Directory | Description | Count |
|---|---|---|
mirobench_text.json |
Text-only query set | 70 |
mirobench_multimodal.json |
Multimodal query set (with image/document attachments) | 30 |
multimodal-attachments/ |
Attachment files referenced by multimodal queries, organized by query ID (e.g., 72/, 93/). Contains images, PDFs, and other documents. |
— |
Query Schema (text-only):
{
"id": 1,
"chat_id": "uuid",
"rewritten_query": "Expanded/rewritten query",
"annotation": {
"category": "text",
"language": "zh | en",
"origin_id": 2,
"pattern": "T1 | T2 | T5 | T6",
"domain": "tech | finance | medical | ...",
"topic": "...",
"persona": "...",
"...": "additional fields omitted"
}
}Query Schema (multimodal):
{
"id": 71,
"chat_id": "uuid",
"rewritten_query": "Expanded/rewritten query",
"files": [
{ "filename": "attachment_71_01.jpg", "type": "image", "dir": "multimodal-attachments/71/attachment_71_01.jpg", "size": "1.5 MB" }
],
"annotation": {
"category": "image | doc | multi_doc",
"language": "zh | en",
"origin_id": 102
}
}Note: Text queries do not have a
filesfield. Multimodal queries do not havepatternordomainfields.
Pattern Taxonomy: (applies to text queries; ~50% of text queries carry a pattern label)
- T1: Landscape Survey
- T2: Comparative Evaluation
- T5: Decision Analysis
- T6: Scheme Design
Domain Distribution: tech, finance, medical, engineering, business, humanities, science, lifestyle, cybersecurity, education, energy, geopolitics, health, legal, policy, trade, other
One JSON file per model, containing a JSON array of complete query-response pairs. Place your model's output file (e.g., <model_name>_text.json) in the appropriate directory (these directories must be created by the user).
Result Schema:
{
"id": 1,
"chat_id": "uuid",
"rewritten_query": "Rewritten query",
"annotation": { "..." },
"response": "Model-generated research report",
"process": "String of research process"
}The response field contains the model's final report output. The process field contains the model's intermediate research process trace (format varies by model). Multimodal entries additionally contain a files field (see Query Schema above).
Active fact-checking powered by the MiroFlow agent framework. Automatically extracts and verifies key factual statements in reports via search engines.
- Report Segmentation: Splits the model-generated report into logical segments
- Per-segment Fact-checking: Deploys an agent for each segment to gather evidence via web search
- Verdict: Labels each factual statement as
Right(correct) /Wrong(incorrect) /Unknown(unverifiable). For multimodal evaluation, an additionalConflictlabel is used when web sources and attachment content provide contradictory evidence.
factual_eval/
├── config/ # Hydra configuration files
│ ├── benchmark/ # Base benchmark config (factual-eval.yaml)
│ ├── llm/ # LLM model configs
│ ├── tool/ # Tool configs (search, browsing, etc.)
│ ├── prompts/ # Prompt templates
│ ├── benchmark_factual-eval_text.yaml # Canonical config for text-only models
│ └── benchmark_factual-eval_multimodal.yaml # Canonical config for multimodal models
├── miroflow/ # MiroFlow core framework
│ ├── agents/ # Agent implementations (iterative + rollback)
│ ├── benchmark/ # Evaluation runners and verifiers
│ ├── llm/ # Multi-provider LLM support
│ ├── tool/ # MCP server tool integration
│ ├── io_processor/ # I/O processors (segmentation, summarization, etc.)
│ ├── logging/ # Task tracing and logging decorators
│ ├── skill/ # Skill manager and definitions
│ └── utils/ # Utility functions
├── utils/
│ ├── convert_to_factual_eval.py # Convert method_results JSON array → per-item files
│ └── check_progress_factual_eval.py
├── scripts/
│ ├── run_factual_eval.sh # Main run script
│ └── run_single_factual_eval_task.py # Single-task runner
├── .env.template # Environment variables template
└── pyproject.toml # Dependencies (Python >= 3.11)
Factual eval reads per-item JSON files from data/factual-eval/<model-dir>/ inside factual_eval/.
The base data directory is configured via the DATA_DIR environment variable. The shell script defaults to ./data (relative to factual_eval/), while the Hydra config falls back to ../../miroflow/data. It is recommended to set DATA_DIR explicitly in your .env file.
Step 1 — Convert raw results to per-item files (one-time, skip if already done):
cd factual_eval
# Convert a method_results JSON array → individual files in factual_eval/data/factual-eval/
python utils/convert_to_factual_eval.py \
--input ../data/method_results/mirothinker_v17_text_demo.json \
--output-dir data/factual-eval/mirothinker-v17-text-demoThe output format is one JSON file per item (same schema as the source), named <model-name>_<id>.json.
The loader also supports reading directly from a JSON array file via --source-file (see Usage below).
For multimodal queries, attachment files are stored in data/input_queries/multimodal-attachments/<query_id>/ and referenced via the files field.
cd factual_eval
# Install dependencies
uv sync
# Configure API keys (copy template and fill in values)
cp .env.template .env
# Edit .env with your API keys (OPENAI_API_KEY, SERPER_API_KEY, etc.)
# Optionally set DATA_DIR to the absolute path of miroflow/data/cd factual_eval
# Evaluate a specific model (from pre-converted per-item files)
bash scripts/run_factual_eval.sh --model-dir mirothinker-v17-text
# Evaluate directly from a JSON array file (no pre-conversion needed)
bash scripts/run_factual_eval.sh --source-file mirothinker_v17_text_100.json
# Multimodal evaluation
bash scripts/run_factual_eval.sh \
--config config/benchmark_factual-eval_multimodal.yaml \
--model-dir mirothinker-v17-multimodal
# Limit number of tasks (for testing)
bash scripts/run_factual_eval.sh --model-dir chatgpt-text-only --max-tasks 5
# Control concurrency
bash scripts/run_factual_eval.sh --model-dir mirothinker-v17-text \
--max-concurrent 5 --max-concurrent-chunks 5
# Resume a previous run
bash scripts/run_factual_eval.sh --result-dir logs/factual-eval/prev_runEach query produces a JSON result containing a core_state list:
{
"core_state": [
{
"statement": "The statement being verified",
"verification": "Right | Wrong | Unknown | Conflict",
"evidence": [
{ "source": "Evidence source URL", "excerpt": "Quoted key text from source" }
],
"reasoning": "Explanation of the verification reasoning and process"
}
]
}Key Metric: Correct Statement Ratio = Right / (Right + Wrong + Unknown + Conflict)
Comprehensive Adaptive Point-wise Quality Evaluation that dynamically generates evaluation dimensions, criteria, and weights for each query task, enabling fine-grained quality assessment.
The evaluation pipeline consists of 5 stages:
- Dimension Generation: LLM generates 1-3 task-specific additional dimensions (supplementing 4 fixed dimensions)
- Weight Assignment: Assigns normalized weights to all dimensions (summing to 1.0)
- Criteria Generation: Generates 1-10 specific evaluation criteria per dimension
- Per-criteria Scoring: Scores the report against each criterion (0-10)
- Hierarchical Aggregation: Criteria scores -> dimension scores -> total weighted score
4 Fixed Dimensions:
| Dimension | Description |
|---|---|
| Coverage | Breadth, depth, and relevance of coverage |
| Insight | Depth, originality, logic, and analytical value |
| Instruction Following | Accuracy in meeting all query requirements |
| Clarity | Readability, fluency, structure, and ease of understanding |
point_quality/
├── deepresearcharena/ # Core evaluation framework
│ ├── evaluator/ # Evaluator implementations
│ │ ├── base_evaluator.py
│ │ ├── pointwise_core.py
│ │ └── pointwise_evaluator.py
│ ├── prompts/ # Prompt templates
│ ├── cache/ # Caching system
│ ├── config/ # YAML configuration files
│ └── utils/ # LLM calls, config loading
├── outputs/ # Auto-created on first run; stores results and logs
├── run_batch_eval.py # Entry point script
Takes a model result JSON file from the shared data/ directory as input via the --input flag:
cd point_quality
python run_batch_eval.py --input ../data/method_results/mirothinker_v17_text_demo.json --model_name mirothinker_v17The input file is a JSON array of entries, each containing rewritten_query, response, and optional files for attachments. Attachment file paths in the dir field (e.g., multimodal-attachments/72/attachment_72_01.jpg) are resolved relative to {data_dir}/input_queries/multimodal/.
cd point_quality
pip install openai python-dotenv pyyaml
# Configure API keys (copy template and fill in values)
cp .env.template .env
# Edit .env with your OPENAI_API_KEY (or OPENROUTER_API_KEY)# Text-only evaluation
python run_batch_eval.py --input ../data/method_results/mirothinker_v17_text_demo.json --model_name mirothinker_v17
# Multimodal evaluation (attachments resolved automatically from data/input_queries/multimodal/)
python run_batch_eval.py --input ../data/method_multimodal_results/mirothinker_v17_multimodal_demo.json --model_name mirothinker_v17
# Specify evaluator model and query count
python run_batch_eval.py --input ../data/method_results/claude_text.json --model_name claude \
--evaluator_model gpt-5.1 --max_queries 50
# Reuse criteria from a previous run (only re-score)
python run_batch_eval.py --input ../data/method_results/gemini_text.json --model_name gemini \
--criteria_file outputs/mirothinker_v17_results.jsonConfiguration file located at deepresearcharena/config/pointwise.yaml. Key fields:
evaluator_model:
name: "gpt-5.1" # Judge LLM
api_type: "auto" # auto (detect by model name), openai, or openrouter
temperature: 0.1
evaluation:
max_workers: 20 # Parallel workers
scoring:
score_range: [0, 10]
decimal_places: 2{
"summary": {
"models": {
"mirothinker": {
"average_total_score": 8.807,
"total_queries": 70,
"dimension_averages": {
"coverage_score": 8.5,
"insight_score": 8.6,
"instruction_following_score": 9.48,
"clarity_score": 9.36
}
}
}
},
"query_results": { ... }
}Evaluates the quality of a model's research process (intermediate reasoning, search strategies, etc.) and the alignment between the process and the final report.
The evaluation consists of two phases:
Phase 1 - Structuring:
- Auto-detects different models' process trace formats (JSON array, block tags, step tags, plain text, etc.)
- Uses LLM to unify heterogeneous formats into a structured JSON schema (step list + global findings)
Phase 2 - Evaluation:
- Intrinsic Evaluation: 5 dimensions assessing the research process quality itself
- Alignment Evaluation: 3 dimensions assessing consistency between process and report
8 Evaluation Dimensions:
| Type | Dimension | Description |
|---|---|---|
| Intrinsic | search_breadth | Diversity of sources and angles explored |
| Intrinsic | analytical_depth | Depth of analysis and insight |
| Intrinsic | progressive_refinement | Ability to iteratively deepen investigation |
| Intrinsic | critical_thinking | Cross-verification and critical reasoning |
| Intrinsic | efficiency | Conciseness and effectiveness of steps |
| Alignment | findings_to_report | Fraction of process findings covered in the report |
| Alignment | report_to_process | Whether report claims can be traced back to the process |
| Alignment | contradiction | Consistency between process and report (10 = fully consistent) |
process_eval/
├── run_pipeline.py # Entry point script
├── config/
│ ├── process_eval.yaml # Text-only evaluation config
│ └── process_eval_multimodal.yaml # Multimodal evaluation config
├── process_evaluator/ # Core package
│ ├── pipeline.py # Pipeline orchestrator
│ ├── data_loader.py # Data loading
│ ├── preprocessors/ # Multi-format preprocessors (auto-detection)
│ ├── structuring/ # LLM-based structuring
│ ├── evaluation/ # Intrinsic + alignment evaluators
│ ├── cache/ # Thread-safe JSON caching
│ └── utils/ # LLM client, config loading
├── .env.template # Environment variables template
└── requirements.txt
Reads model result files directly from the shared data directory. The data loader auto-discovers files by matching {model_name}_{data_type}*.json pattern:
# config/process_eval.yaml
data:
data_dir: "../data/method_results" # Text-only
# config/process_eval_multimodal.yaml
data:
data_dir: "../data/method_multimodal_results" # MultimodalThe process field in each model result file contains the research process trace; the response field contains the final report.
cd process_eval
pip install -r requirements.txt # openai, pyyaml, tqdm, python-dotenv
# Configure API keys (copy template and fill in values)
cp .env.template .env
# Edit .env with your OPENAI_API_KEY (or OPENROUTER_API_KEY)Supports both text-only and multimodal evaluation, selected via config file:
- Text-only (default):
config/process_eval.yaml— reads fromdata/method_results/ - Multimodal:
config/process_eval_multimodal.yaml— reads fromdata/method_multimodal_results/
# Text-only evaluation (default)
python run_pipeline.py
# Multimodal evaluation
python run_pipeline.py --config config/process_eval_multimodal.yaml
# Run structuring phase only
python run_pipeline.py --phase phase1
# Run evaluation phase only (requires phase1 to be completed first)
python run_pipeline.py --phase phase2
# Specify models and entry count
python run_pipeline.py --models claude gemini --max-entries 10
# Evaluate specific entry IDs with custom parallelism
python run_pipeline.py --entry-ids 1 2 3 --max-workers 4
# Clear cache and re-run
python run_pipeline.py --clear-cache{
"summary": {
"mirothinker": {
"search_breadth": { "avg": 8.2, "count": 70 },
"analytical_depth": { "avg": 7.8, "count": 70 },
"progressive_refinement": { "avg": 8.1, "count": 70 },
"critical_thinking": { "avg": 7.5, "count": 70 },
"efficiency": { "avg": 7.9, "count": 70 },
"findings_to_report": { "avg": 8.3, "count": 70 },
"report_to_process": { "avg": 7.6, "count": 70 },
"contradiction": { "avg": 8.8, "count": 70 },
"intrinsic_avg": 8.1,
"alignment_avg": 8.23,
"overall_avg": 8.17
}
},
"entry_results": { ... }
}| Aspect | Factual Eval | Point Quality | Process Eval |
|---|---|---|---|
| Goal | Report factual correctness | Report content quality | Research process quality |
| Method | Agent + web search verification | LLM multi-dimension scoring | LLM structuring + scoring |
| Data Input | response (report text) | response (report text) | process + response |
| Scoring Scale | Right / Wrong / Unknown | 0-10 continuous | 1-10 integer |
| Judge LLM | GPT-5-mini (default) | GPT-5.1 (default) | GPT-5.2 (default) |
| Parallelism | Async + semaphore | ThreadPoolExecutor | ThreadPoolExecutor |
| Caching | None (agent state) | Multi-level JSON cache | Three-level JSON cache |
| Python | >= 3.11 (uv) | >= 3.10 (pip) | >= 3.10 (pip) |
@misc{ye2026miroevalbenchmarkingmultimodaldeep,
title={MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome},
author={Fangda Ye and Yuxin Hu and Pengxiang Zhu and Yibo Li and Ziqi Jin and Yao Xiao and Yibo Wang and Lei Wang and Zhen Zhang and Lu Wang and Yue Deng and Bin Wang and Yifan Zhang and Liangcai Su and Xinyu Wang and He Zhao and Chen Wei and Qiang Ren and Bryan Hooi and An Bo and Shuicheng Yan and Lidong Bing},
year={2026},
eprint={2603.28407},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2603.28407},
}Apache-2.0

