-
Notifications
You must be signed in to change notification settings - Fork 214
Description
#GitHub Issue Report - Simplified Version
Title
pyseekdb embedded mode crashes with SIGSEGV (exit 139) when importing large collections (>1000 documents)
Environment information
- OS: macOS Darwin 25.3.0 (Linux also tested)
- Python: 3.11.8 (3.14 is not compatible)
- pyseekdb: latest version (pip install)
- Mode: Embedded mode (
pyseekdb.Client(path="./data/db"))
Problem description
When using pyseekdb's embedded mode to import data, the Python process crashes and returns exit code 139 (SIGSEGV) when the collection data size exceeds about 1000 documents.
Key Features
- ❌ No Python exceptions (direct SEGSEGV)
- ❌ No error messages
- ❌ Inconsistency: the same data sometimes succeeds and sometimes fails
- ✅ Small data sets (<1000 items) can be successful
Minimal reproduction code
import pyseekdb
#Create client
client = pyseekdb.Client(path="./test.db")
collection = client.get_or_create_collection(name="test_collection")
# Generate test data (2000 items)
documents = [f"Test document {i}" for i in range(2000)]
ids = [f"id_{i}" for i in range(2000)]
metadatas = [{"index": i} for i in range(2000)]
# Batch import
batch_size = 50
for i in range(0, len(documents), batch_size):
print(f"Batch {i//batch_size + 1}...")
collection.add(
ids=ids[i:i+batch_size],
documents=documents[i:i+batch_size],
metadatas=metadatas[i:i+batch_size]
)
print(f"Total: {collection.count()}")Run: python3 test_crash.py
Expected: 2000 items successfully imported
Actual: Crash on batch 10-20 (exit code 139)
Actual test data
| Collection name | Number of documents | Status | Crash point |
|---|---|---|---|
| edd_codes | 3,600 | ✅ Success | - |
| edd_data_items | 600 | ✅ Success | - |
| edd_domains | 600 | ✅ Success | - |
| physical_model_table | 868 | ✅ Success | - |
| logical_model_attribute | 24,669 | ❌ crash | ~500-1000 |
| physical_model_field | 29,322 | ❌ crash | ~250 |
Tried solutions (all failed)
- ✅ Python 3.14 → 3.11 (initial crash fixed)
- ✅ Reduce batch_size: 1000 → 100 → 50 → 10
- ✅ Use global client (such as official example)
- ✅ Convert numpy type to Python type
- ✅ Chunked import (one process for every 1000 items)
- ✅ Checkpoint recovery mechanism
**All scenarios still crash when importing >1000 items. **
Error log
$ python import.py
[seekdb] seekdb has opened
Reading Excel file...
Loaded 24669 rows
Building data...
Importing to collection 'test'...
Progress: 50/24669
Progress: 100/24669
Progress: 150/24669
Progress: 200/24669
[Process exits with code 139 - no error message]expected behavior
It should be possible to reliably import 10,000+ documents, as stated in the official documentation.
Actual behavior
Crash when exceeding ~1000 entries, making embedded mode unusable for medium to large datasets in production environments.
Temporary solution
Using server mode may be more stable (untested):
seekdb server start --port 2881
export SEEKDB_HOST=localhost
export SEEKDB_PORT=2881
python import.pyScope of influence
This issue blocks any production deployments that use pyseekdb embedded mode to process real-world datasets.
Request:
- Confirm that this is a known issue of embedded mode
- Provide a stable import solution for large data sets
- Fix the SEGSEGV problem in embedded mode, or
- Clearly state in the documentation that embedded mode is not suitable for production environments/large data sets