Cache-Aware Micro-ETL Benchmark

A high-performance "Systems-Level" Data Engineering project designed to bridge the gap between low-level hardware constraints (CPU Cache) and high-level Python data pipelines.

---

The Goal

The primary objective of this project is to demonstrate and measure how Memory Hierarchy (L1/L2/L3 Cache) and Memory Layout (Row-based vs. Columnar) impact the performance of data processing pipelines.

While most data engineers focus on orchestration (Airflow, dBT), this project focuses on the Compute Layer—understanding the "physics" of data processing to build the fastest, most cost-effective systems possible.

Technical Impact & Problem Solved

Modern CPUs are incredibly fast, but RAM is relatively slow. This creates the "Memory Wall."

Variant A (Baseline): Demonstrates the "Pointer Chasing" problem where Python objects are scattered in memory, causing the CPU to stall while waiting for data.
Variant C (Columnar): Showcases Vectorization and SIMD, where packed data allows the CPU to process millions of records in a single "gulp."

Real-World Business Value

Stakeholder	The Problem	The Micro-ETL Solution
Cloud FinOps	Rising compute costs for ETL.	Reducing CPU time by 10x-100x through cache efficiency.
FinTech/HFT	Millisecond latencies in market data.	Using contiguous memory layouts to avoid cache misses.
Data Engineers	Scaling pipelines for "Big Data."	Knowing when to switch from in-memory (Pandas) to streaming (Polars).

Project Architecture

The system compares several processing variants to prove performance shifts:

Variant A — Row-Based Pure Python: Baseline pointer chasing overhead.
Variant B — NumPy Batched: Contiguous arrays to cut Python dispatch costs.
Variant C — Pandas Batched: DataFrame vectorization on columnar-friendly parquet.
Variant D — Polars Columnar: Arrow-native, cache-friendly columnar execution.
Variant E — DuckDB SQL: Vectorized SQL engine over Parquet/Arrow buffers.
Variant F — Semi-Structured JSONL: Nested, variable-width payloads to expose cache misses.
Variant G — Out-of-Core Streaming: Chunked I/O to handle data that exceeds RAM.

Evidence & Further Reading

This project is built upon decades of research in database internals and high-performance computing.

S2024 #06 - Vectorized Query Execution (CMU): Why the CPU cache is the most important part of a database.
Latency Numbers Every Programmer Should Know: Visualizing the massive speed gap between Cache and RAM.
What is Apache Arrow?: The standard for contiguous, columnar memory.
The "Memory Wall" Problem: A deep dive into why CPU speed outpaces memory bandwidth.

Built with love for High-Performance Data Engineering.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
docs		docs
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
project_outline.txt		project_outline.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cache-Aware Micro-ETL Benchmark

The Goal

Technical Impact & Problem Solved

Real-World Business Value

Project Architecture

Evidence & Further Reading

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cache-Aware Micro-ETL Benchmark

The Goal

Technical Impact & Problem Solved

Real-World Business Value

Project Architecture

Evidence & Further Reading

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages