A high-performance "Systems-Level" Data Engineering project designed to bridge the gap between low-level hardware constraints (CPU Cache) and high-level Python data pipelines.
The primary objective of this project is to demonstrate and measure how Memory Hierarchy (L1/L2/L3 Cache) and Memory Layout (Row-based vs. Columnar) impact the performance of data processing pipelines.
While most data engineers focus on orchestration (Airflow, dBT), this project focuses on the Compute Layer—understanding the "physics" of data processing to build the fastest, most cost-effective systems possible.
Modern CPUs are incredibly fast, but RAM is relatively slow. This creates the "Memory Wall."
- Variant A (Baseline): Demonstrates the "Pointer Chasing" problem where Python objects are scattered in memory, causing the CPU to stall while waiting for data.
- Variant C (Columnar): Showcases Vectorization and SIMD, where packed data allows the CPU to process millions of records in a single "gulp."
| Stakeholder | The Problem | The Micro-ETL Solution |
|---|---|---|
| Cloud FinOps | Rising compute costs for ETL. | Reducing CPU time by 10x-100x through cache efficiency. |
| FinTech/HFT | Millisecond latencies in market data. | Using contiguous memory layouts to avoid cache misses. |
| Data Engineers | Scaling pipelines for "Big Data." | Knowing when to switch from in-memory (Pandas) to streaming (Polars). |
The system compares several processing variants to prove performance shifts:
- Variant A — Row-Based Pure Python: Baseline pointer chasing overhead.
- Variant B — NumPy Batched: Contiguous arrays to cut Python dispatch costs.
- Variant C — Pandas Batched: DataFrame vectorization on columnar-friendly parquet.
- Variant D — Polars Columnar: Arrow-native, cache-friendly columnar execution.
- Variant E — DuckDB SQL: Vectorized SQL engine over Parquet/Arrow buffers.
- Variant F — Semi-Structured JSONL: Nested, variable-width payloads to expose cache misses.
- Variant G — Out-of-Core Streaming: Chunked I/O to handle data that exceeds RAM.
This project is built upon decades of research in database internals and high-performance computing.
- S2024 #06 - Vectorized Query Execution (CMU): Why the CPU cache is the most important part of a database.
- Latency Numbers Every Programmer Should Know: Visualizing the massive speed gap between Cache and RAM.
- What is Apache Arrow?: The standard for contiguous, columnar memory.
- The "Memory Wall" Problem: A deep dive into why CPU speed outpaces memory bandwidth.
Built with love for High-Performance Data Engineering.
