From fd35d9f1869bf858add1e4479c587f8561167cb2 Mon Sep 17 00:00:00 2001 From: Jonathan M Hsieh Date: Tue, 14 Apr 2026 11:50:49 -0700 Subject: [PATCH 1/5] Add bulk load/update columns documentation for Geneva Covers the load_columns API for joining pre-computed column data from external sources (Parquet, Lance, IPC) into LanceDB tables by primary key. Includes usage examples, missing key handling, performance tuning (memory sizing, concurrency, checkpointing, multi-pass loads, cost model), and links to SDK reference. Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/docs.json | 1 + docs/geneva/jobs/bulk-load-columns.mdx | 327 +++++++++++++++++++++++++ 2 files changed, 328 insertions(+) create mode 100644 docs/geneva/jobs/bulk-load-columns.mdx diff --git a/docs/docs.json b/docs/docs.json index 023a1a7..0b15e36 100644 --- a/docs/docs.json +++ b/docs/docs.json @@ -176,6 +176,7 @@ "pages": [ "geneva/jobs/contexts", "geneva/jobs/backfilling", + "geneva/jobs/bulk-load-columns", "geneva/jobs/materialized-views", "geneva/jobs/lifecycle", "geneva/jobs/conflicts", diff --git a/docs/geneva/jobs/bulk-load-columns.mdx b/docs/geneva/jobs/bulk-load-columns.mdx new file mode 100644 index 0000000..9b87a62 --- /dev/null +++ b/docs/geneva/jobs/bulk-load-columns.mdx @@ -0,0 +1,327 @@ +--- +title: Bulk Loading & Updating Columns +sidebarTitle: Bulk Load / Update Columns +description: Load or update column data from external sources (Parquet, Lance, IPC) into your LanceDB table using a primary-key join. +icon: columns-3 +--- + +## Overview + +A common scenario is having column data that **already exists** in an external dataset — embeddings from a vendor, features exported from a data warehouse, or columnar data in cloud storage — that you want to load into an existing LanceDB table. + +`load_columns` joins value columns from an external source into your table by primary key. It works for both use cases: + +- **Adding new columns:** If the specified columns don't exist in the destination table, they are created automatically. +- **Updating existing columns:** If the columns already exist, matched rows are updated with the source values. Unmatched rows are controlled by the `on_missing` parameter. + +**Destination table (before):** + +| pk | col_a | col_b | +|----|-------|-------| +| 1 | x | 10 | +| 2 | y | 20 | +| 3 | z | 30 | + +**External source (Parquet / Lance / IPC):** + +| pk | embedding | +|----|------------| +| 1 | [.1, .2] | +| 2 | [.3, .4] | +| 3 | [.5, .6] | + +**Destination table (after `load_columns` join on `pk`):** + +| pk | col_a | col_b | embedding | +|----|-------|-------|-----------| +| 1 | x | 10 | [.1, .2] | +| 2 | y | 20 | [.3, .4] | +| 3 | z | 30 | [.5, .6] | + +### When to use + +- **Loading new columns:** Attach pre-computed embeddings from a vendor, or add features exported from Spark/BigQuery as Parquet. +- **Updating existing columns:** Replace outdated embeddings with a newer model's output, or refresh feature values from an updated export. +- **Partial updates:** Update a subset of rows (e.g., only rows whose embeddings were recomputed) while preserving all other values via carry semantics. +- **Format consolidation:** Merge columnar data spread across Parquet files into an existing Lance table. + +## Basic usage + + +```python Python icon="python" +import lancedb + +db = lancedb.connect("my_db") +table = db.open_table("my_table") + +# Add a new embedding column from a Parquet source +table.load_columns( + source="s3://bucket/embeddings/", + pk="document_id", + columns=["embedding"], +) + +# Later, update the same column with refreshed embeddings +table.load_columns( + source="s3://bucket/embeddings_v2/", + pk="document_id", + columns=["embedding"], +) +``` + + +This reads the external Parquet dataset, matches rows by `document_id`, and writes the `embedding` column into the destination table. If the column doesn't already exist, it is created. If it does exist, matched rows are updated with the new source values. + +## Supported source formats + +`load_columns` supports three source formats: + +| Format | File extension | Notes | +|---------|---------------|-------| +| Parquet | `.parquet` | Most common; supports cloud storage URIs | +| Lance | `.lance` | Must be a single URI (not a file list) | +| IPC | `.ipc`, `.arrow`, `.feather` | Arrow IPC / Feather format | + +The format is auto-detected from the URI suffix. You can override it with `source_format`: + + +```python Python icon="python" +table.load_columns( + source="/data/embeddings/", + pk="pk", + columns=["embedding"], + source_format="lance", +) +``` + + +## Handling missing keys + +When the source doesn't cover every row in the destination, the `on_missing` parameter controls what happens to unmatched rows: + +| Mode | Behavior | +|------|----------| +| `"carry"` (default) | Keep existing value. NULL if the column is new. | +| `"null"` | Explicitly set to NULL. | +| `"error"` | Raise an error on the first unmatched row. | + + +```python Python icon="python" +# Default: unmatched rows keep their current value +table.load_columns( + source="s3://bucket/partial_embeddings/", + pk="document_id", + columns=["embedding"], + on_missing="carry", +) + +# Strict mode: fail if source doesn't cover all rows +table.load_columns( + source="s3://bucket/embeddings/", + pk="document_id", + columns=["embedding"], + on_missing="error", +) +``` + + +The `carry` mode is particularly important for partial and multi-pass loads — it ensures that previously loaded values are never overwritten. + +## Loading multiple columns + +You can load multiple columns in a single call: + + +```python Python icon="python" +table.load_columns( + source="s3://bucket/features/", + pk="document_id", + columns=["embedding", "sentiment_score", "category"], +) +``` + + +## Async API + +For non-blocking execution, use `load_columns_async` which returns a `JobFuture`: + + +```python Python icon="python" +future = table.load_columns_async( + source="s3://bucket/embeddings/", + pk="document_id", + columns=["embedding"], +) + +# Do other work... + +# Block until completion +future.result() +``` + + +## Validation and error handling + +`load_columns` validates inputs before starting the distributed job: + +- The **primary key column** must exist in both source and destination with compatible types. +- The **value columns** must exist in the source dataset. +- **Duplicate primary keys** in the source raise a `ValueError`. +- **NULL primary keys** in the source are excluded with a warning. +- Type mismatches between source and destination raise a `ValueError` during planning. + +## Performance tuning + +### Memory sizing + +`load_columns` builds an in-memory primary-key index from the source dataset before distributing work. The index size depends on the primary key type and row count: + +| PK type | ~Memory per row | 100M rows | 1B rows | +|---------|----------------|-----------|---------| +| int64 | ~8 bytes | ~800 MB | ~8 GB | +| string (avg 32 bytes) | ~32 bytes | ~3.2 GB | ~32 GB | + +The index is broadcast to all worker nodes. Each node holds one zero-copy reference, so the per-node cost equals the index size regardless of how many workers run on that node. + + +If your source has **string primary keys**, expect significantly higher memory usage than integer keys. For very large string-keyed sources, consider multi-pass loads to keep per-pass index size manageable. + + +### Index build latency + +The index is built with a single sequential scan of the source pk column before any workers start. Expect roughly 30-60 seconds for 1B int64 rows. This is upfront latency — once built, worker lookups are O(1). + +### Concurrency + +The `concurrency` parameter controls the number of worker processes processing destination fragments in parallel. The default is 8. + +- Set this to match your available cluster resources (e.g., number of CPUs or GPUs). +- Higher concurrency speeds up the destination scan phase but does not affect the index build phase. +- If set higher than available resources, Geneva will schedule as many workers as it can. + + +```python Python icon="python" +# Use 16 workers for faster processing +table.load_columns( + source="s3://bucket/embeddings/", + pk="document_id", + columns=["embedding"], + concurrency=16, +) +``` + + +### Checkpoint sizing and fault tolerance + +Bulk load jobs use the same checkpoint infrastructure as UDF backfill. Each batch is checkpointed so that partial results are not lost on job failure and resumed jobs skip already-completed work. + +Adaptive checkpoint sizing works the same as [backfill jobs](/geneva/jobs/backfilling#adaptive-checkpoint-sizing). Smaller initial checkpoints give faster proof-of-life; larger checkpoints reduce commit overhead. + +- `checkpoint_size`: Initial/fixed checkpoint size in rows. +- `min_checkpoint_size` / `max_checkpoint_size`: Bounds for adaptive sizing. When both are equal, adaptive sizing is disabled. +- `checkpoint_interval_seconds`: Target seconds per adaptive checkpoint batch. The adaptive sizer grows or shrinks batch sizes to hit this target. Defaults to 60 seconds for bulk load (longer than the 10-second UDF backfill default because bulk load is I/O-bound and benefits from larger batches that amortize write overhead). + + +If your job is small enough to complete without needing fault tolerance, you can get better performance by effectively disabling checkpoints. Increase `checkpoint_interval_seconds` to a large value and set `min_checkpoint_size` high enough that each worker processes its entire workload in a single batch. This avoids checkpoint write overhead entirely. + + + +```python Python icon="python" +table.load_columns( + source="s3://bucket/embeddings/", + pk="document_id", + columns=["embedding"], + checkpoint_size=5000, + min_checkpoint_size=1000, + max_checkpoint_size=10000, +) +``` + + +### Commit visibility + +The `commit_granularity` parameter controls how many fragments must complete before an intermediate commit makes results visible to readers. Lower values give more frequent visibility but add commit overhead. This is especially useful for long-running jobs on large tables. + + +```python Python icon="python" +# Commit every 10 fragments so readers see incremental progress +table.load_columns( + source="s3://bucket/embeddings/", + pk="document_id", + columns=["embedding"], + commit_granularity=10, +) +``` + + +### Multi-pass loads for large sources + +When the source dataset is too large for a single in-memory primary-key index, split the source files into chunks and run sequential `load_columns` calls. Carry semantics guarantee correctness — each pass writes only rows matching its chunk and preserves all values from prior passes. + + +```python Python icon="python" +import pyarrow.dataset as pads + +# Discover source files (metadata only, no data I/O) +source_files = pads.dataset("s3://bucket/embeddings/", format="parquet").files + +# Split into N chunks and run sequentially +N = 4 +chunk_size = len(source_files) // N +for i in range(N): + chunk = source_files[i * chunk_size : (i + 1) * chunk_size] + table.load_columns( + source=chunk, + pk="document_id", + columns=["embedding"], + ) +``` + + +Each pass reads only its assigned files, so the total source I/O across all passes stays at 1x the full dataset. Per-pass memory cost is `source_size / N`. + +If the source is already partitioned into subdirectories, you can pass each subdirectory URI directly: + + +```python Python icon="python" +for shard in range(4): + table.load_columns( + source=f"s3://bucket/embeddings/shard_{shard}/", + pk="document_id", + columns=["embedding"], + ) +``` + + + +Multi-pass loads must run **sequentially**, not concurrently. Two `load_columns` calls running at the same time against the same column produce an interleaved end state (last-writer-wins per fragment). Use a plain `for` loop, not `concurrent.futures`. + + +#### When to use multi-pass + +| Scenario | Recommendation | +|----------|---------------| +| Source pk index fits in memory | Single call — simplest and fastest | +| Source too large, carry columns are light (new or scalar columns) | Multi-pass with N = 4-8 chunks | +| Source too large, carry columns are heavy (existing embeddings) | Multi-pass with small N, or wait for future partitioned-index support | + +### Cost model + +The dominant cost factors are: + +``` +Dest I/O ≈ N_passes × num_fragments × (pk_bytes + carry_col_bytes) × rows_per_fragment +Source I/O ≈ source_rows × value_col_bytes (constant — each source row is read exactly once) +``` + +Source I/O is fixed regardless of approach. The variable cost is **destination I/O**, which scales with `N_passes × carry_col_bytes`: + +- **New columns** (carry volume = 0): Destination I/O is minimal — only pk + rowaddr are read. Multi-pass is cheap. +- **Updating existing wide columns** (e.g., 768-dim embeddings): Each pass must read existing values to carry them. Multiple passes multiply this cost. + +For single-pass loads, destination I/O is read once. For multi-pass with N chunks, it's read N times. + +## Reference + +* [`load_columns` API](https://lancedb.github.io/geneva/api/table/#geneva.table.Table.load_columns) +* [`load_columns_async` API](https://lancedb.github.io/geneva/api/table/#geneva.table.Table.load_columns_async) From 63c3805545413f8b0338a8e32b41e1c4a2b1ad8c Mon Sep 17 00:00:00 2001 From: Jonathan M Hsieh Date: Tue, 14 Apr 2026 12:02:30 -0700 Subject: [PATCH 2/5] Reorganize and tighten bulk load columns docs Move concurrency, commit visibility, and multi-pass loads under performance tuning. Consolidate supported formats, multiple columns, async API, and validation into briefer inline mentions. Remove references to Ray internals. Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/geneva/jobs/bulk-load-columns.mdx | 124 +++++-------------------- 1 file changed, 23 insertions(+), 101 deletions(-) diff --git a/docs/geneva/jobs/bulk-load-columns.mdx b/docs/geneva/jobs/bulk-load-columns.mdx index 9b87a62..9dbeb65 100644 --- a/docs/geneva/jobs/bulk-load-columns.mdx +++ b/docs/geneva/jobs/bulk-load-columns.mdx @@ -47,6 +47,8 @@ A common scenario is having column data that **already exists** in an external d ## Basic usage +Supports Parquet, Lance, and IPC sources. The format is auto-detected from the URI suffix, or can be overridden with `source_format`. You can load one or more columns in a single call. + ```python Python icon="python" import lancedb @@ -70,30 +72,7 @@ table.load_columns( ``` -This reads the external Parquet dataset, matches rows by `document_id`, and writes the `embedding` column into the destination table. If the column doesn't already exist, it is created. If it does exist, matched rows are updated with the new source values. - -## Supported source formats - -`load_columns` supports three source formats: - -| Format | File extension | Notes | -|---------|---------------|-------| -| Parquet | `.parquet` | Most common; supports cloud storage URIs | -| Lance | `.lance` | Must be a single URI (not a file list) | -| IPC | `.ipc`, `.arrow`, `.feather` | Arrow IPC / Feather format | - -The format is auto-detected from the URI suffix. You can override it with `source_format`: - - -```python Python icon="python" -table.load_columns( - source="/data/embeddings/", - pk="pk", - columns=["embedding"], - source_format="lance", -) -``` - +For non-blocking execution, use `load_columns_async` which returns a `JobFuture` — call `.result()` to block until completion. ## Handling missing keys @@ -127,70 +106,8 @@ table.load_columns( The `carry` mode is particularly important for partial and multi-pass loads — it ensures that previously loaded values are never overwritten. -## Loading multiple columns - -You can load multiple columns in a single call: - - -```python Python icon="python" -table.load_columns( - source="s3://bucket/features/", - pk="document_id", - columns=["embedding", "sentiment_score", "category"], -) -``` - - -## Async API - -For non-blocking execution, use `load_columns_async` which returns a `JobFuture`: - - -```python Python icon="python" -future = table.load_columns_async( - source="s3://bucket/embeddings/", - pk="document_id", - columns=["embedding"], -) - -# Do other work... - -# Block until completion -future.result() -``` - - -## Validation and error handling - -`load_columns` validates inputs before starting the distributed job: - -- The **primary key column** must exist in both source and destination with compatible types. -- The **value columns** must exist in the source dataset. -- **Duplicate primary keys** in the source raise a `ValueError`. -- **NULL primary keys** in the source are excluded with a warning. -- Type mismatches between source and destination raise a `ValueError` during planning. - ## Performance tuning -### Memory sizing - -`load_columns` builds an in-memory primary-key index from the source dataset before distributing work. The index size depends on the primary key type and row count: - -| PK type | ~Memory per row | 100M rows | 1B rows | -|---------|----------------|-----------|---------| -| int64 | ~8 bytes | ~800 MB | ~8 GB | -| string (avg 32 bytes) | ~32 bytes | ~3.2 GB | ~32 GB | - -The index is broadcast to all worker nodes. Each node holds one zero-copy reference, so the per-node cost equals the index size regardless of how many workers run on that node. - - -If your source has **string primary keys**, expect significantly higher memory usage than integer keys. For very large string-keyed sources, consider multi-pass loads to keep per-pass index size manageable. - - -### Index build latency - -The index is built with a single sequential scan of the source pk column before any workers start. Expect roughly 30-60 seconds for 1B int64 rows. This is upfront latency — once built, worker lookups are O(1). - ### Concurrency The `concurrency` parameter controls the number of worker processes processing destination fragments in parallel. The default is 8. @@ -225,19 +142,6 @@ Adaptive checkpoint sizing works the same as [backfill jobs](/geneva/jobs/backfi If your job is small enough to complete without needing fault tolerance, you can get better performance by effectively disabling checkpoints. Increase `checkpoint_interval_seconds` to a large value and set `min_checkpoint_size` high enough that each worker processes its entire workload in a single batch. This avoids checkpoint write overhead entirely. - -```python Python icon="python" -table.load_columns( - source="s3://bucket/embeddings/", - pk="document_id", - columns=["embedding"], - checkpoint_size=5000, - min_checkpoint_size=1000, - max_checkpoint_size=10000, -) -``` - - ### Commit visibility The `commit_granularity` parameter controls how many fragments must complete before an intermediate commit makes results visible to readers. Lower values give more frequent visibility but add commit overhead. This is especially useful for long-running jobs on large tables. @@ -254,6 +158,25 @@ table.load_columns( ``` +### Memory sizing + +`load_columns` builds an in-memory primary-key index from the source dataset before distributing work. The index size depends on the primary key type and row count: + +| PK type | ~Memory per row | 100M rows | 1B rows | +|---------|----------------|-----------|---------| +| int64 | ~8 bytes | ~800 MB | ~8 GB | +| string (avg 32 bytes) | ~32 bytes | ~3.2 GB | ~32 GB | + +The index is broadcast to all worker nodes. Each node holds one zero-copy reference, so the per-node cost equals the index size regardless of how many workers run on that node. + + +If your source has **string primary keys**, expect significantly higher memory usage than integer keys. For very large string-keyed sources, consider multi-pass loads to keep per-pass index size manageable. + + +### Index build latency + +The index is built with a single sequential scan of the source pk column before any workers start. Expect roughly 30-60 seconds for 1B int64 rows. This is upfront latency — once built, worker lookups are O(1). + ### Multi-pass loads for large sources When the source dataset is too large for a single in-memory primary-key index, split the source files into chunks and run sequential `load_columns` calls. Carry semantics guarantee correctness — each pass writes only rows matching its chunk and preserves all values from prior passes. @@ -302,8 +225,7 @@ Multi-pass loads must run **sequentially**, not concurrently. Two `load_columns` | Scenario | Recommendation | |----------|---------------| | Source pk index fits in memory | Single call — simplest and fastest | -| Source too large, carry columns are light (new or scalar columns) | Multi-pass with N = 4-8 chunks | -| Source too large, carry columns are heavy (existing embeddings) | Multi-pass with small N, or wait for future partitioned-index support | +| Source too large for single index | Multi-pass — choose N so that `source_size / N` fits in memory | ### Cost model From aacbfd83c14291b2a25ebca249a32835961e2904 Mon Sep 17 00:00:00 2001 From: Jonathan M Hsieh Date: Tue, 14 Apr 2026 12:08:15 -0700 Subject: [PATCH 3/5] Tighten bulk load docs for practitioner audience Drop cost model and index build latency sections. Merge memory sizing into multi-pass where it's actionable. Trim concurrency, checkpointing, and commit visibility to essentials. Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/geneva/jobs/bulk-load-columns.mdx | 72 +++++--------------------- 1 file changed, 14 insertions(+), 58 deletions(-) diff --git a/docs/geneva/jobs/bulk-load-columns.mdx b/docs/geneva/jobs/bulk-load-columns.mdx index 9dbeb65..fdb3ded 100644 --- a/docs/geneva/jobs/bulk-load-columns.mdx +++ b/docs/geneva/jobs/bulk-load-columns.mdx @@ -110,15 +110,10 @@ The `carry` mode is particularly important for partial and multi-pass loads — ### Concurrency -The `concurrency` parameter controls the number of worker processes processing destination fragments in parallel. The default is 8. - -- Set this to match your available cluster resources (e.g., number of CPUs or GPUs). -- Higher concurrency speeds up the destination scan phase but does not affect the index build phase. -- If set higher than available resources, Geneva will schedule as many workers as it can. +The `concurrency` parameter controls the number of worker processes. The default is 8 — set this to match your available cluster resources. ```python Python icon="python" -# Use 16 workers for faster processing table.load_columns( source="s3://bucket/embeddings/", pk="document_id", @@ -128,27 +123,23 @@ table.load_columns( ``` -### Checkpoint sizing and fault tolerance - -Bulk load jobs use the same checkpoint infrastructure as UDF backfill. Each batch is checkpointed so that partial results are not lost on job failure and resumed jobs skip already-completed work. +### Checkpointing -Adaptive checkpoint sizing works the same as [backfill jobs](/geneva/jobs/backfilling#adaptive-checkpoint-sizing). Smaller initial checkpoints give faster proof-of-life; larger checkpoints reduce commit overhead. +Bulk load jobs checkpoint each batch for fault tolerance, using the same infrastructure as [backfill jobs](/geneva/jobs/backfilling#adaptive-checkpoint-sizing). Key parameters: -- `checkpoint_size`: Initial/fixed checkpoint size in rows. -- `min_checkpoint_size` / `max_checkpoint_size`: Bounds for adaptive sizing. When both are equal, adaptive sizing is disabled. -- `checkpoint_interval_seconds`: Target seconds per adaptive checkpoint batch. The adaptive sizer grows or shrinks batch sizes to hit this target. Defaults to 60 seconds for bulk load (longer than the 10-second UDF backfill default because bulk load is I/O-bound and benefits from larger batches that amortize write overhead). +- `checkpoint_interval_seconds`: Target seconds per checkpoint batch (default 60s). The adaptive sizer grows or shrinks batch sizes to hit this target. +- `min_checkpoint_size` / `max_checkpoint_size`: Bounds for adaptive sizing. -If your job is small enough to complete without needing fault tolerance, you can get better performance by effectively disabling checkpoints. Increase `checkpoint_interval_seconds` to a large value and set `min_checkpoint_size` high enough that each worker processes its entire workload in a single batch. This avoids checkpoint write overhead entirely. +If your job is small enough to complete without needing fault tolerance, you can get better performance by effectively disabling checkpoints. Increase `checkpoint_interval_seconds` to a large value and set `min_checkpoint_size` high enough that each worker processes its entire workload in a single batch. ### Commit visibility -The `commit_granularity` parameter controls how many fragments must complete before an intermediate commit makes results visible to readers. Lower values give more frequent visibility but add commit overhead. This is especially useful for long-running jobs on large tables. +For long-running jobs, `commit_granularity` controls how many fragments complete before an intermediate commit makes partial results visible to readers. ```python Python icon="python" -# Commit every 10 fragments so readers see incremental progress table.load_columns( source="s3://bucket/embeddings/", pk="document_id", @@ -158,28 +149,18 @@ table.load_columns( ``` -### Memory sizing +### Multi-pass loads for large sources + +`load_columns` builds an in-memory primary-key index from the source. If the source is too large to fit in memory, split it into chunks and run sequential calls. Carry semantics guarantee correctness across passes. -`load_columns` builds an in-memory primary-key index from the source dataset before distributing work. The index size depends on the primary key type and row count: +Index memory depends on primary key type: | PK type | ~Memory per row | 100M rows | 1B rows | |---------|----------------|-----------|---------| | int64 | ~8 bytes | ~800 MB | ~8 GB | | string (avg 32 bytes) | ~32 bytes | ~3.2 GB | ~32 GB | -The index is broadcast to all worker nodes. Each node holds one zero-copy reference, so the per-node cost equals the index size regardless of how many workers run on that node. - - -If your source has **string primary keys**, expect significantly higher memory usage than integer keys. For very large string-keyed sources, consider multi-pass loads to keep per-pass index size manageable. - - -### Index build latency - -The index is built with a single sequential scan of the source pk column before any workers start. Expect roughly 30-60 seconds for 1B int64 rows. This is upfront latency — once built, worker lookups are O(1). - -### Multi-pass loads for large sources - -When the source dataset is too large for a single in-memory primary-key index, split the source files into chunks and run sequential `load_columns` calls. Carry semantics guarantee correctness — each pass writes only rows matching its chunk and preserves all values from prior passes. +Choose N so that `source_size / N` fits in memory: ```python Python icon="python" @@ -201,9 +182,7 @@ for i in range(N): ``` -Each pass reads only its assigned files, so the total source I/O across all passes stays at 1x the full dataset. Per-pass memory cost is `source_size / N`. - -If the source is already partitioned into subdirectories, you can pass each subdirectory URI directly: +Each pass reads only its assigned files, so total source I/O stays at 1x. If the source is already partitioned into subdirectories, pass each URI directly: ```python Python icon="python" @@ -217,32 +196,9 @@ for shard in range(4): -Multi-pass loads must run **sequentially**, not concurrently. Two `load_columns` calls running at the same time against the same column produce an interleaved end state (last-writer-wins per fragment). Use a plain `for` loop, not `concurrent.futures`. +Multi-pass loads must run **sequentially**, not concurrently. Two `load_columns` calls running at the same time against the same column produce an interleaved end state. Use a plain `for` loop, not `concurrent.futures`. -#### When to use multi-pass - -| Scenario | Recommendation | -|----------|---------------| -| Source pk index fits in memory | Single call — simplest and fastest | -| Source too large for single index | Multi-pass — choose N so that `source_size / N` fits in memory | - -### Cost model - -The dominant cost factors are: - -``` -Dest I/O ≈ N_passes × num_fragments × (pk_bytes + carry_col_bytes) × rows_per_fragment -Source I/O ≈ source_rows × value_col_bytes (constant — each source row is read exactly once) -``` - -Source I/O is fixed regardless of approach. The variable cost is **destination I/O**, which scales with `N_passes × carry_col_bytes`: - -- **New columns** (carry volume = 0): Destination I/O is minimal — only pk + rowaddr are read. Multi-pass is cheap. -- **Updating existing wide columns** (e.g., 768-dim embeddings): Each pass must read existing values to carry them. Multiple passes multiply this cost. - -For single-pass loads, destination I/O is read once. For multi-pass with N chunks, it's read N times. - ## Reference * [`load_columns` API](https://lancedb.github.io/geneva/api/table/#geneva.table.Table.load_columns) From 9a259f413810b9724794b823b5d80c2b2687ad37 Mon Sep 17 00:00:00 2001 From: Jonathan M Hsieh Date: Tue, 14 Apr 2026 12:20:20 -0700 Subject: [PATCH 4/5] Add Geneva 0.13.0 version badge to bulk load columns page Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/geneva/jobs/bulk-load-columns.mdx | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/geneva/jobs/bulk-load-columns.mdx b/docs/geneva/jobs/bulk-load-columns.mdx index fdb3ded..d195c48 100644 --- a/docs/geneva/jobs/bulk-load-columns.mdx +++ b/docs/geneva/jobs/bulk-load-columns.mdx @@ -5,6 +5,8 @@ description: Load or update column data from external sources (Parquet, Lance, I icon: columns-3 --- +Introduced in Geneva 0.13.0 + ## Overview A common scenario is having column data that **already exists** in an external dataset — embeddings from a vendor, features exported from a data warehouse, or columnar data in cloud storage — that you want to load into an existing LanceDB table. From 3b6df342648b1037a0ece45afef0a73c0f63d8d1 Mon Sep 17 00:00:00 2001 From: Jonathan M Hsieh Date: Wed, 15 Apr 2026 13:28:28 -0700 Subject: [PATCH 5/5] Fix chunk-split example to cover all source files Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/geneva/jobs/bulk-load-columns.mdx | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/geneva/jobs/bulk-load-columns.mdx b/docs/geneva/jobs/bulk-load-columns.mdx index d195c48..c44adb5 100644 --- a/docs/geneva/jobs/bulk-load-columns.mdx +++ b/docs/geneva/jobs/bulk-load-columns.mdx @@ -173,9 +173,9 @@ source_files = pads.dataset("s3://bucket/embeddings/", format="parquet").files # Split into N chunks and run sequentially N = 4 -chunk_size = len(source_files) // N +total = len(source_files) for i in range(N): - chunk = source_files[i * chunk_size : (i + 1) * chunk_size] + chunk = source_files[i * total // N : (i + 1) * total // N] table.load_columns( source=chunk, pk="document_id",