Skip to content

feat(evals): add Promptfoo-based AI tool call evaluation suite#2351

Draft
tupizz wants to merge 20 commits intomainfrom
feat/ai-eval-suite
Draft

feat(evals): add Promptfoo-based AI tool call evaluation suite#2351
tupizz wants to merge 20 commits intomainfrom
feat/ai-eval-suite

Conversation

@tupizz
Copy link
Contributor

@tupizz tupizz commented Mar 10, 2026

Summary

Adds an automated evaluation suite for validating LLM tool call quality across SuperDoc's Document Engine API. This is the foundation for measuring and improving how well models use our tools.

What it does

Given a document editing task (e.g., "Find the indemnification clause and rewrite it"), the suite checks:

  1. Did the model call the right tool? (query_match, not get_document_text)
  2. Did it pass correct arguments? (select.type: "text", not a bare string)
  3. Did it follow production rules? (no mixed rewrite + format batches, correct require values)

Architecture

User prompt          "Replace all mentions of Company A with Acme Corp"
      |
      v
Promptfoo sends      prompt + 6 tool definitions --> LLM (GPT-4o, GPT-5.4, etc.)
to the LLM                    |
                              v
LLM returns          [{ function: { name: "query_match", arguments: {...} } }]
tool calls                    |
                              v
Assertions check     tool-call-f1 + file://lib/assertions.cjs:validOpNames
the output                    |
                              v
Result               PASS / FAIL with reason

Three eval configs for different purposes

promptfooconfig.yaml                 Main suite: 4 OpenAI models, 25 deterministic tests
                                     Cost: ~$0.30 per run. Cached re-runs: free.

promptfooconfig.cross-provider.yaml  Cross-provider: GPT-5.4 vs Claude vs Gemini
                                     Tests with BOTH full prompt and minimal prompt
                                     to measure system prompt value.

promptfooconfig.gdpval.yaml          GDPval benchmark: Model+SuperDoc vs Model-Only
                                     Uses llm-rubric (costs ~$1-2 per run).

File structure

evals/
  promptfooconfig.yaml                 Configs (root level, Promptfoo convention)
  promptfooconfig.cross-provider.yaml
  promptfooconfig.gdpval.yaml

  prompts/                             What we send to LLMs
    agent.txt                          Full system prompt (from labs agent)
    minimal.txt                        Minimal prompt (customer simulation)

  tests/                               What we check
    tool-tests.yaml                    15 tests: tool selection + args + correctness
    workflows.yaml                     11 tests: find/replace, tracked changes, lists
    cross-provider.yaml                10 tests: realistic customer prompts
    gdpval-workflows.yaml              5 tests: Model+SuperDoc vs baseline

  lib/                                 Helper code
    assertions.cjs                     Shared assertion functions (15 exports)
    normalize.cjs                      Cross-provider output normalization
    extract.mjs                        Tool extraction from SDK artifacts
    save-baseline.mjs                  Save versioned result snapshot
    compare-baselines.mjs              Compare two snapshots

How tools are loaded

Tool definitions come from the SDK-generated packages/sdk/tools/tools.openai.json. The extraction script reads tools-policy.json for the essential tool list and writes a subset to lib/essential.json (gitignored). This keeps tool definitions DRY and automatically picks up upstream changes.

packages/sdk/tools/tools-policy.json  -->  lib/extract.mjs  -->  lib/essential.json
packages/sdk/tools/tools.openai.json       (reads both)          (6 tools, gitignored)

Cross-provider output normalization

LLM providers return tool calls in different formats:

  • OpenAI: [{function: {name, arguments}}]
  • Anthropic: {type: "tool_use", name, input}
  • Google: [{functionCall: {name, args}}]

lib/normalize.cjs converts all formats to OpenAI's array format so assertions work across providers.

Test results (initial baseline)

Model Pass Rate Notes
GPT-4o 100% (25/25) Best tool-calling accuracy
GPT-4.1 100% (25/25) Matches GPT-4o with tool_choice: required
GPT-5.4 96% (24/25) Calls get_document_text for list ops
GPT-4.1-mini 96% (24/25) Uses text.insert for headings

Cross-provider results (minimal vs full prompt)

Model + Prompt Pass Rate
GPT-5.4 + Full prompt 100%
GPT-5.4 + Minimal prompt 80%
Claude Sonnet 4.6 + Full prompt 80%
Gemini 2.5 Pro + Minimal prompt 70%
Claude Sonnet 4.6 + Minimal prompt 40%

Key finding: Our system prompt doubles Claude's accuracy (40% to 80%). This proves the value of our tool documentation.

Why Promptfoo

Evaluated Promptfoo, Braintrust, DeepEval, Langfuse, and OpenAI Evals. Promptfoo won because:

  • TypeScript native (matches our stack)
  • YAML-driven (no code to run tests)
  • Built-in tool-call-f1 assertion type
  • Caching (re-runs after assertion changes are free)
  • Web UI for inspecting results
  • GitHub Actions support for CI

Commands

pnpm run extract-tools   # Extract tools from SDK (run once)
pnpm run eval            # Run main suite (~$0.30)
pnpm run eval:cross      # Cross-provider comparison
pnpm run eval:gdpval     # GDPval benchmark (~$1-2)
pnpm run eval:view       # Open web UI

Test plan

  • pnpm run extract-tools extracts 6 tools
  • pnpm run eval passes 97%+ on 4 OpenAI models
  • pnpm run eval:cross runs across 3 providers
  • Assertions produce descriptive failure reasons
  • Cached re-runs are instant and free
  • CI workflow (future PR)

Add automated evaluation infrastructure for validating LLM tool call
quality across SuperDoc's Document Engine API. Tests whether models
select the correct tools and construct valid arguments when given
document editing tasks.

The suite extracts 6 essential tool definitions from the SDK and
runs them against multiple OpenAI models and cross-provider comparisons
(Anthropic, Google). Includes deterministic assertions for tool
selection, argument accuracy, and production correctness rules learned
from the labs agent implementation.
tupizz added 19 commits March 10, 2026 10:15
Updated the GDPval benchmark configuration to include distinct prompts for SuperDoc tool-augmented and baseline models. Enhanced the test assertions in the GDPval workflows to provide clearer scoring criteria for model responses, focusing on the specificity and executable nature of the responses. Adjusted thresholds for scoring to better reflect the quality of tool calls and text descriptions in document editing tasks.
Changed the model identifier from GPT-4o to GPT-5.4 in the GDPval benchmark configuration for both SuperDoc tool-augmented and baseline prompts, ensuring alignment with the latest model updates.
…oc agent

Introduced a new execution test suite for the SuperDoc agent, validating real document editing capabilities through the CLI. Added a new script command for executing these tests and updated the GDPval configuration to reflect the latest GPT model version. Included necessary dependencies and created a new provider for the SuperDoc agent to facilitate the execution of tool calls against DOCX files.
… validation

Updated the SuperDoc agent to create temporary copies of documents for editing, ensuring original fixtures remain unaltered. Implemented round-trip validation to verify that edits persist after saving and re-opening DOCX files. Added a new memorandum fixture and expanded execution tests to cover various document editing scenarios, enhancing overall test coverage and reliability.
…rvation

Enhanced the SuperDoc agent to include a `keepFile` option, allowing users to save edited documents to a specified output directory. Updated the logic to create the output directory if it doesn't exist and modified the cleanup process to conditionally copy the edited document based on this new option. Adjusted execution tests to validate the new functionality, ensuring comprehensive coverage of document editing scenarios.
…ate execution logic

Enhanced the SuperDoc agent's execution configuration by increasing the `maxConcurrency` from 1 to 5, allowing for more efficient concurrent test execution. Updated the cleanup process to ensure isolated state directories are properly managed, improving resource handling during tests. Adjusted execution tests to reflect these changes, ensuring robust validation of document editing capabilities.
…ool configuration

Refactored the SuperDoc agent's evaluation scripts to streamline the execution process and improve clarity. Removed the deprecated cross-provider configuration and consolidated tool evaluation logic into a unified structure. Introduced new assertion checks for tool quality and argument accuracy, ensuring comprehensive validation of document editing tasks. Updated the test suite to reflect these changes, enhancing overall test coverage and reliability.
…or SuperDoc agent

Introduced the AI Gateway API key in the environment configuration to enable optional integration with Vercel AI Gateway. Added a new script command for executing evaluations through the gateway, enhancing the SuperDoc agent's capabilities. Created a new YAML configuration file for execution tests via the AI Gateway, allowing for testing across multiple models. Updated the package dependencies to include the necessary SDK for AI Gateway functionality.
…mer prompt tests

Updated the SuperDoc agent to include tracking of total usage and steps during text generation, improving performance insights. Added a series of customer prompt tests in YAML format to validate various document editing tasks, ensuring comprehensive coverage of real-world scenarios. This enhancement aims to bolster the agent's capabilities and testing framework.
…d files

Removed the JavaScript assertion file and context builder, simplifying the evaluation framework. Updated the prompt configuration to eliminate unused metrics and added new document fixtures for testing. Enhanced execution tests to validate document editing capabilities with the new fixtures, ensuring comprehensive coverage of various scenarios.
Updated the model labels in the execution gateway configuration for clarity and accuracy. Refined the execution test descriptions to better reflect the specific tasks being validated, enhancing the readability and intent of the tests. Commented out deprecated Google provider configurations to streamline the YAML files.
…files

Updated the .gitignore to exclude temporary files and removed deprecated YAML configuration files related to GDPval and execution tests. Streamlined the package.json by eliminating unused evaluation scripts, enhancing overall project organization and clarity.
…agement

Updated pnpm-lock.yaml to reflect new versions of dependencies, including @types/node and added new SDK entries for SuperDoc. Modified .gitignore to exclude additional temporary files and states, improving project cleanliness and organization.
Added a caching system to the SuperDoc agent and gateway providers to improve performance by storing and retrieving results based on a generated cache key. Updated the utility functions to handle cache operations, ensuring efficient reuse of previous evaluation results. Modified the evaluation logic to check for cached results before executing tasks, enhancing overall efficiency in the evaluation framework. Additionally, updated the package.json to reflect changes in evaluation scripts and added a new YAML configuration for end-to-end tests via the AI Gateway.
…nhanced documentation

Updated the evaluation framework to include two levels of testing: tool quality and execution. Enhanced the README to clarify testing processes, commands, and configurations. Introduced new YAML files for tool quality and execution tests, detailing the number of tests and providers involved. Improved command descriptions for better usability and added new document fixtures for comprehensive testing of document editing capabilities.
Introduced a new Vercel tools provider for the SuperDoc evaluation framework, enabling structured tool calls with the Vercel AI SDK. Updated the package.json to include a new script for evaluating tools with the Vercel configuration. Enhanced the prompt configuration by adding a new YAML file for tool evaluations and refined existing evaluation scripts to support the new provider. Additionally, made minor adjustments to the presentation HTML for improved accessibility and clarity.
…oc/common dependency

Removed outdated naive-ui entries and added @superdoc/common as a workspace dependency in pnpm-lock.yaml, ensuring the project reflects the latest dependency structure.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant