DriftBench

DriftBench is a benchmarking toolkit that quantifies how data drift and workload drift influence database behavior. It powers the experiments in the DriftBench paper with reproducible drift generation, query synthesis, and downstream evaluation utilities.

Highlights

Unified handling of data and workload drift with shared abstractions.
Declarative experiment definitions through DriftSpec YAML files.
Support for CSV, Parquet, and PostgreSQL sources plus downstream workloads.
End-to-end assets (schemas, templates, workloads, plots) for paper reproduction.

Web Frontend (Figma Design)

The Figma UI has been ported into a standalone web frontend under web/. It provides the DriftBench pages (Home, Drift Types, Generator, Visualization, Case Studies) and is intended for UI demos and walkthroughs.

Run the UI

cd web
npm install
npm run dev

DriftSpec at a Glance

DriftSpec is the YAML contract that tells DriftBench which drift scenario to build. A single file captures the drift family, data source, variables, and optional workload generation hooks, making experiments portable and versionable.

Key Elements

type: declares whether the pattern targets data or workloads (single-table or multi-table).
data_source: describes how to access the base data and optionally extract a schema.
variables: parameterizes the drift operators, output paths, and workload knobs.
Optional sections cover temporal stamps, query generation, and downstream processors.

Minimal Example

pattern_id: census-cardinality-demo
seed: 42

type:
  family: data
  category: drift
  subtype: single_table

data_source:
  kind: csv
  path: ./data/census_original.csv

variables:
  base_table: census_original
  drifts:
    - name: vary_cardinality_scale_1
      drift_type: vary_cardinality
      scale: 1.0
      output_path: ./output/data/cardinality/scale/census_original_cardinality_1.csv

Run any specification with:

python -m driftbench.cli run-yaml <path-to-yaml>

Custom Deletion Filters (registry + DriftSpec)

DriftSpec cannot serialize Python callables, so use the filter registry to reference a filter by name.

Register a filter in code:

# my_project/filters.py
from driftbench.core.data.filter_registry import register_filter

@register_filter("age_gt_60")
def age_gt_60(series, config):
    return series > 60

Import the module and reference it in YAML:

filter_registry_modules:
  - my_project.filters

variables:
  base_table: census_original
  drifts:
    - name: delete_age_gt_60
      drift_type: selective_deletion
      n: 5000
      filter:
        column: age
        func_name: age_gt_60
      output_path: ./output/data/cardinality/update/census_original_deletion_age_gt_60.csv

You can also use simple declarative filters without registration:

filter:
  column: timestamp
  min: "2025-07-02T00:00:00"
  max: "2025-07-03T00:00:00"

Trace to DriftSpec (mock flow)

If you already parsed a real database trace into a compact CSV/JSON summary, you can generate a DriftSpec YAML directly:

python -m driftbench.cli trace-to-spec driftspec/trace_inputs/trace_data_mock.csv driftspec/generated/trace_data_mock.yaml
python -m driftbench.cli trace-to-spec driftspec/trace_inputs/trace_workload_mock.json driftspec/generated/trace_workload_mock.yaml

The mock inputs live in driftspec/trace_inputs/ and show the minimal fields the generator expects.

Explore complete templates in driftspec/examples/, including:

Data drift patterns: demo_data_single.yaml
PostgreSQL single-table: demo_postgres.yaml
PostgreSQL multi-table: demo_postgres_multi.yaml
Workload drift: workload_census.yaml

Census Temporal Demos (Data)

These specs create timestamped census data and time growth scenarios under output/data/time_demo/.

demo_data_census_timestamp.yaml: add a timestamp column with uniform arrivals.
demo_data_census_time_growth.yaml: combine a base day with an age-skewed day to show a distribution shift.
demo_data_census_time_growth_3x.yaml: 3-day time growth (uniform + periodic + bursty).
demo_data_census_time_growth_4x.yaml: 4-day time growth (adds a long-tail day).

Run them with:

python -m driftbench.cli run-yaml driftspec/examples/demo_data_census_timestamp.yaml
python -m driftbench.cli run-yaml driftspec/examples/demo_data_census_time_growth.yaml
python -m driftbench.cli run-yaml driftspec/examples/demo_data_census_time_growth_3x.yaml
python -m driftbench.cli run-yaml driftspec/examples/demo_data_census_time_growth_4x.yaml

Behind the scenes, the runner loads type handlers registered in driftbench/spec/types/ and executes them through driftbench/spec/core.py.

Paper Artifacts

Everything required to reproduce the results in the DriftBench paper lives in this repository.

Input datasets
- data/census_original.csv: baseline census table.
- data/census_outliers.csv: injected outliers for case studies.
- data/PG_info.json: PostgreSQL connection metadata.
Schemas and templates
- output/intermediate/census_original_schema.json: extracted single-table schema.
- output/intermediate/census_original_templates.json: workload templates for the census data.
- output/intermediate/tpcds_schema.json: inferred multi-table schema from TPC-DS.
Data drift outputs
- output/data/cardinality/scale/: scaled datasets (e.g., ×0.1, ×1).
- output/data/cardinality/update/: selective deletion scenarios.
- output/data/distributional/column/: column distribution shifts.
- output/data/distributional/outlier/: rare value injections.
Workload drift outputs
- output/workload/parametric/distribution/: predicate distribution changes.
- output/workload/parametric/selectivity/: workloads with varying selectivity.
- output/workload/tpcds_sqls_default.csv: multi-table workload derived from TPC-DS.

Case Study Gallery

All visuals were generated from notebooks in driftbench/notebooks/ using the assets listed above.

Varying Cardinality
Selective Deletions
Column Distribution Shifts
Outlier Injection
Workload Drift
Q-Error Benchmarks

Join-Aware Drift Templates

Click to expand a generated multi-table template

{
  "template_id": "T000",
  "cardinality": 1441548,
  "tables": {
    "base": "public.catalog_sales",
    "joins": [
      {
        "type": "FULL JOIN",
        "table": "public.store_sales",
        "condition": "public.catalog_sales.cs_net_profit = public.store_sales.ss_net_profit"
      }
    ]
  },
  "predicate": [
    {
      "column": "public.catalog_sales.cs_warehouse_sk",
      "operator": "<=",
      "type": "numeric",
      "value": "",
      "range": {
        "min": 1,
        "max": 5
      },
      "selectivity": 0.1
    }
  ],
  "payload": {
    "columns": [
      "public.catalog_sales.cs_order_number"
    ],
    "aggregation": null,
    "order_by": "public.catalog_sales.cs_order_number",
    "limit": 100
  }
}

Installation

Requires Python 3.9 or later.
Optional: PostgreSQL instance if you plan to run the database-backed examples.
Create a virtual environment and install the runtime libraries used across the repo:

python -m venv .venv
source .venv/bin/activate
pip install pandas numpy scipy scikit-learn PyYAML psycopg2-binary

Run commands from the repository root so python -m driftbench.cli ... resolves local modules.

Quickstart Workflow

Extract Schemas

python -m test.test_csv_schema_extractor
python -m test.test_postgresql_extractor

Outputs appear in output/intermediate/.

Generate Workload Templates

python -m test.test_template_generator_single_table
python -m test.test_template_generator_multi_table

Templates are written to output/intermediate/.

Produce Drifted Data
```
python -m test.test_data_generator_single_table
```
Generated datasets land in output/data/….

Produce Drifted Workloads

python -m test.test_sql_generator_single_table
python -m test.test_sql_generator_multi_table

Workloads are saved under output/workload/.

These scripts mirror the pipeline used for the case studies and figures. Substitute any of the provided DriftSpec YAML files in driftspec/examples/ to experiment with alternative drift scenarios.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
data		data
docs		docs
driftbench		driftbench
driftbench_service		driftbench_service
driftspec		driftspec
output		output
res		res
tasks		tasks
test		test
web		web
.gitignore		.gitignore
README.md		README.md
main_result_8.pdf		main_result_8.pdf
package-lock.json		package-lock.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DriftBench

Highlights

Web Frontend (Figma Design)

Run the UI

DriftSpec at a Glance

Key Elements

Minimal Example

Custom Deletion Filters (registry + DriftSpec)

Trace to DriftSpec (mock flow)

Census Temporal Demos (Data)

Paper Artifacts

Case Study Gallery

Installation

Quickstart Workflow

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DriftBench

Highlights

Web Frontend (Figma Design)

Run the UI

DriftSpec at a Glance

Key Elements

Minimal Example

Custom Deletion Filters (registry + DriftSpec)

Trace to DriftSpec (mock flow)

Census Temporal Demos (Data)

Paper Artifacts

Case Study Gallery

Installation

Quickstart Workflow

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages