Skip to content

feat: Support cross-database evaluation with SQLite ground truth#465

Open
arieljassan wants to merge 2 commits into
GoogleCloudPlatform:mainfrom
arieljassan:feat/bird-xa-benchmark
Open

feat: Support cross-database evaluation with SQLite ground truth#465
arieljassan wants to merge 2 commits into
GoogleCloudPlatform:mainfrom
arieljassan:feat/bird-xa-benchmark

Conversation

@arieljassan

Copy link
Copy Markdown
Member

Overview

This PR adds support for running the BIRD benchmark (and similar cross-database SQL evaluations) on BigQuery while using local SQLite databases for ground truth reference execution.
When evaluating AI-generated queries on BigQuery against reference queries written in SQLite syntax (e.g., using STRFTIME or SQLite math functions), BigQuery cannot execute the reference queries directly. This update introduces a hybrid bridging mechanism that dynamically resolves reference answers from local SQLite files while evaluating generated queries on BigQuery.

Key Changes

  • evalbench/databases/bigquery.py: Implements the required ensure_database_exists abstract method on BQDB to fulfill the base DB class contract.
  • evalbench/scorers/sqlite_bridge.py & llmrater.py: Adds conditional ground truth resolution. When golden_error occurs and a hybrid judge is configured, LLMRater automatically fetches the true reference execution rows from the local SQLite database.
  • hybrid_xa_judge.py: Adds a self-contained Execution Accuracy (XA) evaluator script for PythonScorer. It normalizes across engine data types (Decimal/Int64 vs float), ignores column header differences, and compares rows order-independently.

Verification

  • Performed functional end-to-end benchmark tests across dual BIRD evaluation datasets (california_schools and card_games).
  • Confirmed functional ground truth resolution from local SQLite reference tables during live BigQuery evaluation runs.
  • Verified clean telemetry data warehousing into BigQuery (FLOAT64 score compatibility).

@arieljassan arieljassan requested a review from IsmailMehdi as a code owner June 28, 2026 17:26
@arieljassan

Copy link
Copy Markdown
Member Author

/gcbrun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant