Skip to content

JustinSobayo/Cache-Aware-Micro-ETL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cache-Aware Micro-ETL Benchmark

A high-performance "Systems-Level" Data Engineering project designed to bridge the gap between low-level hardware constraints (CPU Cache) and high-level Python data pipelines.

---Screenshot 2026-01-08 013625

The Goal

The primary objective of this project is to demonstrate and measure how Memory Hierarchy (L1/L2/L3 Cache) and Memory Layout (Row-based vs. Columnar) impact the performance of data processing pipelines.

While most data engineers focus on orchestration (Airflow, dBT), this project focuses on the Compute Layer—understanding the "physics" of data processing to build the fastest, most cost-effective systems possible.


Technical Impact & Problem Solved

Modern CPUs are incredibly fast, but RAM is relatively slow. This creates the "Memory Wall."

  • Variant A (Baseline): Demonstrates the "Pointer Chasing" problem where Python objects are scattered in memory, causing the CPU to stall while waiting for data.
  • Variant C (Columnar): Showcases Vectorization and SIMD, where packed data allows the CPU to process millions of records in a single "gulp."

Real-World Business Value

Stakeholder The Problem The Micro-ETL Solution
Cloud FinOps Rising compute costs for ETL. Reducing CPU time by 10x-100x through cache efficiency.
FinTech/HFT Millisecond latencies in market data. Using contiguous memory layouts to avoid cache misses.
Data Engineers Scaling pipelines for "Big Data." Knowing when to switch from in-memory (Pandas) to streaming (Polars).

Project Architecture

The system compares several processing variants to prove performance shifts:

  1. Variant A — Row-Based Pure Python: Baseline pointer chasing overhead.
  2. Variant B — NumPy Batched: Contiguous arrays to cut Python dispatch costs.
  3. Variant C — Pandas Batched: DataFrame vectorization on columnar-friendly parquet.
  4. Variant D — Polars Columnar: Arrow-native, cache-friendly columnar execution.
  5. Variant E — DuckDB SQL: Vectorized SQL engine over Parquet/Arrow buffers.
  6. Variant F — Semi-Structured JSONL: Nested, variable-width payloads to expose cache misses.
  7. Variant G — Out-of-Core Streaming: Chunked I/O to handle data that exceeds RAM.

Evidence & Further Reading

This project is built upon decades of research in database internals and high-performance computing.


Built with love for High-Performance Data Engineering.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages