Warning
Dataply is currently in Alpha version. It is experimental and not yet suitable for production use. Internal data structures and file formats are subject to change at any time.
Dataply is a lightweight, high-performance Record Store designed for Node.js. It focuses on storing arbitrary data and providing an auto-generated Primary Key (PK) for ultra-fast retrieval, while supporting core enterprise features like MVCC, WAL, and atomic transactions.
Dataply provides essential features for high-performance data management:
- Identity-Based Access: Manage records through auto-generated Primary Keys for ultra-fast retrieval.
- High-Performance B+Tree: Asynchronous B+Tree structure optimizes both lookups and insertions.
- MVCC & Isolation: Snapshot isolation via Multi-Version Concurrency Control (MVCC) enables non-blocking reads.
- Reliability (WAL): Write-Ahead Logging (WAL) ensures data integrity and automatic crash recovery.
- Atomic Transactions: Full support for ACID-compliant Commit and Rollback operations.
- Efficient Storage: Fixed-size page management with LRU-based page caching and Free List space optimization (Bitmap-based management is deprecated).
- Type Safety: Comprehensive TypeScript definitions for a seamless developer experience.
- Node.js: v18.0.0 or higher
npm install dataplyimport { Dataply } from 'dataply'
// Open Dataply instance
const dataply = new Dataply('./data.db', {
wal: './data.db.wal'
})
async function main() {
// Initialization (Required)
await dataply.init()
// Insert data
const pk = await dataply.insert('Hello, Dataply!')
console.log(`Inserted row with PK: ${pk}`)
// Update data
await dataply.update(pk, 'Updated Data')
console.log(`Updated row with PK: ${pk}`)
// Select data
const data = await dataply.select(pk)
console.log(`Read data: ${data}`)
// Delete data
await dataply.delete(pk)
console.log(`Deleted row with PK: ${pk}`)
// Close dataply
await dataply.close()
}
main()Dataply's auto-generated Primary Key (PK) is perfect for use as a unique identifier in web applications.
import express from 'express'
import { Dataply } from 'dataply'
const app = express()
const db = new Dataply('./web.db')
app.use(express.json())
app.post('/posts', async (req, res) => {
// Dataply returns a numeric PK immediately after insertion
const pk = await db.insert(JSON.stringify(req.body))
res.status(201).json({ id: pk, message: 'Post created!' })
})
app.get('/posts/:id', async (req, res) => {
const data = await db.select(Number(req.params.id))
if (!data) return res.status(404).send('Not Found')
res.json(JSON.parse(data.toString()))
})
// Initialize DB before starting server
db.init().then(() => {
app.listen(3000, () => console.log('Server running on http://localhost:3000'))
})Tip
For more advanced usage like search and optimization, check the Technical Structure Guide and Performance Tuning Guide.
You can group multiple operations into a single unit of work to ensure atomicity. The withWriteTransaction method handles the transaction lifecycle automatically, committing on success and rolling back on failure.
await dataply.withWriteTransaction(async (tx) => {
const pk = await dataply.insert('Data 1', tx)
await dataply.update(pk, 'Updated Data', tx)
}) // Persists changes automatically on success or rolls back on failureYou can perform atomic operations across multiple Dataply instances using the GlobalTransaction class. This safely acquires write locks on all instances sequentially and manages the transaction lifecycle to ensure either all instances commit successfully or all are rolled back.
import { Dataply, GlobalTransaction } from 'dataply'
const db1 = new Dataply('./db1.db', { wal: './db1.wal' })
const db2 = new Dataply('./db2.db', { wal: './db2.wal' })
await db1.init()
await db2.init()
try {
await GlobalTransaction.Run([db1, db2], async ([tx1, tx2]) => {
await db1.insert('Data for DB1', tx1)
await db2.insert('Data for DB2', tx2)
})
} catch (error) {
console.error('Global transaction failed and rolled back.', error)
}If you omit the tx argument, Dataply creates an internal transaction for each operation.
- Security: Atomicity is guaranteed even for single operations.
- Optimization Tip: For bulk operations, use an explicit transaction to significantly reduce I/O overhead and increase performance.
Opens a database file. If the file does not exist, it creates and initializes a new one.
options.pageSize: Size of a page (Default: 8192, must be a power of 2)options.pageCacheCapacity: Maximum number of pages to keep in memory (Default: 10000)options.wal: Path to the WAL file. If omitted, WAL is disabled.options.pagePreallocationCount: The number of pages to preallocate when creating a new page (Default: 1000).options.walCheckpointThreshold: The total number of pages written to the WAL before automatically clearing it (Default: 1000).
Initializes the instance. Must be called before performing any CRUD operations.
Inserts new data. Returns the Primary Key (PK) of the created row.
Forcibly inserts data into an overflow page, even if it could fit within a standard data page. Returns the Primary Key (PK).
Inserts multiple rows at once. This is significantly faster than multiple individual inserts as it minimizes internal transaction overhead.
Retrieves data based on the PK. Returns Uint8Array if asRaw is true.
async selectMany(pks: number[] | Float64Array, asRaw?: boolean, tx?: Transaction): Promise<(string | Uint8Array | null)[]>
Retrieves multiple data records in batch based on the provided PKs. This is more efficient than individual select calls for multiple lookups.
Updates existing data.
Marks data as deleted.
Deletes multiple rows at once. This is significantly faster than multiple individual deletions as it minimizes internal transaction overhead.
Returns the current metadata of the dataply, including pageSize, pageCount, and rowCount.
async withWriteTransaction<T>(callback: (tx: Transaction) => Promise<T>, tx?: Transaction): Promise<T>
Executes write operations within a serialized write-lock transaction. Automatically commits on success and rolls back on failure if creating a new internal transaction.
async withReadTransaction<T>(callback: (tx: Transaction) => Promise<T>, tx?: Transaction): Promise<T>
Executes read operations within a transaction context.
async *withReadStreamTransaction<T>(callback: (tx: Transaction) => AsyncGenerator<T>, tx?: Transaction): AsyncGenerator<T>
Executes streaming read operations using an async generator in a transaction.
Closes the file handles and shuts down safely.
Executes a callback-based global transaction across multiple Dataply instances sequentially locking them to prevent deadlocks, providing true atomicity within the Dataply nodes. Automatically commits on success and rolls back on failure.
If you want to extend Dataply's functionality, use the DataplyAPI class. Unlike the standard Dataply class, DataplyAPI provides direct access to internal components like PageFileSystem or RowTableEngine, offering much more flexibility for custom implementations.
For a detailed guide and examples on how to extend Dataply using Hooks, see Extending Dataply Guide.
import { DataplyAPI } from 'dataply'
class CustomDataply extends DataplyAPI {
// Leverage internal protected members (pfs, rowTableEngine, etc.)
async getInternalStats() {
return {
pageSize: this.options.pageSize,
// Custom internal logic here
}
}
}
const custom = new CustomDataply('./data.db')
await custom.init()
const stats = await custom.getInternalStats()
console.log(stats)Dataply implements the core principles of high-performance storage systems in a lightweight and efficient manner.
For a detailed visual guide on Dataply's internal architecture, class diagrams, and transaction flow, please refer to the Architecture Guide.
- Layered Architecture: Clear separation of concerns between API, Engine, Page System, and I/O Strategy.
- MVCC & Snapshot Isolation: Separation of read/write paths using
Undo Snapshots. - WAL-based Durability: Sequential log writing for reliability and crash recovery.
- Fixed-size Pages: All data is managed in fixed-size units (default 8KB) called pages.
- Page Cache: Minimizes disk I/O by caching frequently accessed pages in memory (LRU Strategy).
- Dirty Page Tracking: Tracks modified pages (Dirty) to synchronize them with disk efficiently only at the time of commit.
- Free List Management: Efficiently tracks the allocation and deallocation of pages using a Free List (stack-like structure), facilitating fast space reclamation and reuse. (The older Bitmap-based mechanism is deprecated but remains for backward compatibility). For more details on this mechanism, see Page Reclamation and Reuse Guide.
- Detailed Structure: For technical details on the physical layout, see structure.md.
Dataply uses a Slotted Page architecture to manage records efficiently:
- Pages: Consists of a 100-byte header (containing
type,id,checksum, etc.) and a body where rows are stored. Slot offsets are stored at the end of the page to track row positions. - Rows: Each row has a 9-byte header (
flags,size,PK) followed by the actual data. Large records automatically trigger Overflow Pages to handle data exceeding page capacity. - Keys & Identifiers: Uses a 6-byte Primary Key (PK) for logical mapping and a 6-byte Record Identifier (RID) (Slot + Page ID) for direct physical addressing.
- Non-blocking Reads: Read operations are not blocked by write operations.
- Undo Log: When a transaction modifies a page, it keeps the original data in an Undo Buffer. Other transactions trying to read the same page are served this snapshot to ensure consistent reads.
- Rollback Mechanism: Upon transaction failure, the Undo Buffer is used to instantly restore pages to their original state.
- Performance and Reliability: All changes are recorded in a sequential log file (WAL) before being written to the actual data file. This converts random writes into sequential writes for better performance and ensures data integrity.
- Crash Recovery: When restarting after an unexpected shutdown, Dataply reads the WAL to automatically replay (Redo) any changes that weren't yet reflected in the data file.
-
Page-level Locking: Prevents data contention by controlling sequential access to pages through the
LockManager. -
B+Tree Index: Uses a B+Tree structure guaranteeing
$O(\log N)$ performance for maximized PK lookup efficiency.
Dataply is optimized for high-speed data processing. Automated benchmarks are executed on every push to the main branch to ensure consistent performance.
You can view the real-time performance trend and detailed metrics on our Performance Dashboard.
Tip
Continuous Monitoring: We use github-action-benchmark to monitor performance changes. For every PR, a summary of the performance impact is automatically commented to help maintain high efficiency.
As Dataply is currently in Alpha, there are several limitations to keep in mind:
- PK-Only Access: Data can only be retrieved or modified using the Primary Key. No secondary indexes or complex query logic are available yet.
- No SQL Support: This is a low-level Record Store. It does not support SQL or any higher-level query language.
- Memory Usage: The Page cache size is controlled by
pageCacheCapacity, but excessive use of large records should be handled with care.
While JSON is simple, Dataply is designed for scalable and reliable data management:
| Feature | JSON File Approach | Dataply Record Store |
|---|---|---|
| Memory usage | Loads entire file into RAM | Constant memory via page-based I/O |
| Search speed | Linear scan (O(N)) | B+Tree Index lookups (O(log N)) |
| Integrity | High risk of corruption on crash | Protected by WAL and Transactions |
| Concurrency | Single-user only | Multi-user via MVCC & Locking |
Dataply is a low-level record store that provides high-performance ACID persistence. You can use it to build:
- Simple Websites: Create forums or blogs using local files without complex database setup.
- Post Identity Management: The Primary Key (PK) automatically generated and returned during
insertcan be directly used as a unique URL ID for posts (e.g.,/post/1024). - Custom Storage Engines: Implement domain-specific document databases, caching layers, or log collectors.
Absolutely! By leveraging DataplyAPI, you can implement custom indexing (like secondary indexes), query parsers, and complex data schemas. Dataply handles the difficult aspects of transaction management, crash recovery (WAL), and concurrency control, letting you focus on your database's unique features.
Dataply uses a 2-byte slots for data positioning within a page. This allows for a theoretical maximum of 65,536 (
With
Using 4-byte (unsigned int) Page IDs and the default 8KB page size, Dataply can manage up to 32TB of data (
It is optional. While disabling WAL can improve write performance by reducing synchronous I/O, it is highly recommended for any production-like environment to ensure data integrity and automatic recovery after a system crash.
Dataply utilizes a combination of page-level locking and MVCC (Multi-Version Concurrency Control). This allows for Snapshot Isolation, meaning readers can access a consistent state of the data without being blocked by ongoing write operations.
Contributions are welcome! Since Dataply is currently in its Alpha stage, your feedback, bug reports, and feature suggestions are invaluable for shaping the future of this project.
- Report Bugs: If you find a bug, please open an issue with detailed steps to reproduce.
- Suggest Features: Have an idea for a new feature? We'd love to hear it!
- Submit PRs: Feel free to submit Pull Requests for bug fixes or improvements. Please ensure your code follows the existing style and includes appropriate tests.
MIT