From 377d2abd6ff7184af0b4a41767729f586c325303 Mon Sep 17 00:00:00 2001 From: zjwu0522 Date: Fri, 12 Jun 2026 10:46:14 +0000 Subject: [PATCH] docs: mark Verified as the default task set; deprecate earlier versions State that the repository's standard tasks are now the MCPMark Verified set, that results on earlier task versions are deprecated and not directly comparable, and ask reporters to label scores as MCPMark Verified. Co-Authored-By: Claude Opus 4.8 (1M context) --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index e4c36739..043ece9c 100644 --- a/README.md +++ b/README.md @@ -14,13 +14,13 @@ An evaluation suite for agentic models in real MCP tool environments (Notion / G MCPMark provides a reproducible, extensible benchmark for researchers and engineers: one-command tasks, isolated sandboxes, auto-resume for failures, unified metrics, and aggregated reports. -> 🚀 **MCPMark Verified is out** — a version-pinned, stabilized subset of the standard tasks for reproducible evaluation. On the Verified set, `gpt-5.5` (xhigh) leads at **92.9%** and `kimi-k2.7` reaches **81.1%**. See [#264](https://github.com/eval-sys/mcpmark/pull/264). +> 🚀 **MCPMark Verified is now the default.** The standard tasks in this repository are the Verified set — every environment version-pinned and every verifier stabilized. Results from earlier task versions are deprecated and not directly comparable, so please report new numbers as **MCPMark Verified**. On the Verified set, `gpt-5.5` (xhigh) leads at **92.9%** and `kimi-k2.7` reaches **81.1%**. See [#264](https://github.com/eval-sys/mcpmark/pull/264). [![MCPMark](https://github.com/user-attachments/assets/dfc06a41-e387-45e3-bc98-db7097ffa3dc)](https://mcpmark.ai) ## News -- 🚀 **12 Jun** — **MCPMark Verified** is out: a version-pinned, stabilized subset of the standard tasks. On the Verified set, `gpt-5.5` (xhigh) leads at **92.9%** and `kimi-k2.7` reaches **81.1%**. See [#264](https://github.com/eval-sys/mcpmark/pull/264). +- 🚀 **12 Jun** — **MCPMark Verified** is now the default task set: every environment version-pinned and every verifier stabilized. Earlier task versions are deprecated; please report scores as **MCPMark Verified**. On the Verified set, `gpt-5.5` (xhigh) leads at **92.9%** and `kimi-k2.7` reaches **81.1%**. See [#264](https://github.com/eval-sys/mcpmark/pull/264). - 📣 **27 May** — The previous Notion Source Hub page is deprecated; please use the new link: [MCPMark Source Hub](https://gossamer-sawfish-47c.notion.site/MCPMark-Source-Hub-dc32b7e8cebd82b8959b81ae322df87a). - 📌 **21 Jan** — Pinned MCP server versions for reproducible benchmarks: GitHub MCP Server `v0.15.0` (switched to Docker for version control), Notion MCP Server `@1.9.1` (Notion released 2.0 but it has many bugs, not recommended). See [#246](https://github.com/eval-sys/mcpmark/pull/246). - 🔥 **13 Dec** — Added auto-compaction support (`--compaction-token`) to summarize long conversations and avoid context overflow during evaluation ([#236](https://github.com/eval-sys/mcpmark/pull/236])).