feat: Support cross-database evaluation with SQLite ground truth by arieljassan · Pull Request #465 · GoogleCloudPlatform/evalbench

arieljassan · 2026-06-28T17:26:22Z

Overview

This PR adds support for running the BIRD benchmark (and similar cross-database SQL evaluations) on BigQuery while using local SQLite databases for ground truth reference execution.
When evaluating AI-generated queries on BigQuery against reference queries written in SQLite syntax (e.g., using STRFTIME or SQLite math functions), BigQuery cannot execute the reference queries directly. This update introduces a hybrid bridging mechanism that dynamically resolves reference answers from local SQLite files while evaluating generated queries on BigQuery.

Key Changes

evalbench/databases/bigquery.py: Implements the required ensure_database_exists abstract method on BQDB to fulfill the base DB class contract.
evalbench/scorers/sqlite_bridge.py & llmrater.py: Adds conditional ground truth resolution. When golden_error occurs and a hybrid judge is configured, LLMRater automatically fetches the true reference execution rows from the local SQLite database.
hybrid_xa_judge.py: Adds a self-contained Execution Accuracy (XA) evaluator script for PythonScorer. It normalizes across engine data types (Decimal/Int64 vs float), ignores column header differences, and compares rows order-independently.

Verification

Performed functional end-to-end benchmark tests across dual BIRD evaluation datasets (california_schools and card_games).
Confirmed functional ground truth resolution from local SQLite reference tables during live BigQuery evaluation runs.
Verified clean telemetry data warehousing into BigQuery (FLOAT64 score compatibility).

…d truth resolution

arieljassan · 2026-06-29T10:26:17Z

/gcbrun

feat(scorers): add hybrid execution accuracy judging and SQLite groun…

1c07b41

…d truth resolution

arieljassan requested a review from IsmailMehdi as a code owner June 28, 2026 17:26

Merge branch 'main' into feat/bird-xa-benchmark

bb25c95

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Support cross-database evaluation with SQLite ground truth#465

feat: Support cross-database evaluation with SQLite ground truth#465
arieljassan wants to merge 2 commits into
GoogleCloudPlatform:mainfrom
arieljassan:feat/bird-xa-benchmark

arieljassan commented Jun 28, 2026

Uh oh!

arieljassan commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

arieljassan commented Jun 28, 2026

Overview

Key Changes

Verification

Uh oh!

arieljassan commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant