Dataply

Warning

Dataply is currently in Alpha version. It is experimental and not yet suitable for production use. Internal data structures and file formats are subject to change at any time.

Dataply is a lightweight, high-performance Record Store designed for Node.js. It focuses on storing arbitrary data and providing an auto-generated Primary Key (PK) for ultra-fast retrieval, while supporting core enterprise features like MVCC, WAL, and atomic transactions.

Key Features

Dataply provides essential features for high-performance data management:

Identity-Based Access: Manage records through auto-generated Primary Keys for ultra-fast retrieval.
High-Performance B+Tree: Asynchronous B+Tree structure optimizes both lookups and insertions.
MVCC & Isolation: Snapshot isolation via Multi-Version Concurrency Control (MVCC) enables non-blocking reads.
Reliability (WAL): Write-Ahead Logging (WAL) ensures data integrity and automatic crash recovery.
Atomic Transactions: Full support for ACID-compliant Commit and Rollback operations.
Efficient Storage: Fixed-size page management with LRU-based page caching and Free List space optimization (Bitmap-based management is deprecated).
Type Safety: Comprehensive TypeScript definitions for a seamless developer experience.

Installation

Prerequisites

Node.js: v18.0.0 or higher

npm install dataply

Quick Start

import { Dataply } from 'dataply'

// Open Dataply instance
const dataply = new Dataply('./data.db', {
  wal: './data.db.wal'
})

async function main() {
  // Initialization (Required)
  await dataply.init()

  // Insert data
  const pk = await dataply.insert('Hello, Dataply!')
  console.log(`Inserted row with PK: ${pk}`)

  // Update data
  await dataply.update(pk, 'Updated Data')
  console.log(`Updated row with PK: ${pk}`)

  // Select data
  const data = await dataply.select(pk)
  console.log(`Read data: ${data}`)

  // Delete data
  await dataply.delete(pk)
  console.log(`Deleted row with PK: ${pk}`)

  // Close dataply
  await dataply.close()
}

main()

Integration Example (Express.js)

Dataply's auto-generated Primary Key (PK) is perfect for use as a unique identifier in web applications.

import express from 'express'
import { Dataply } from 'dataply'

const app = express()
const db = new Dataply('./web.db')

app.use(express.json())

app.post('/posts', async (req, res) => {
  // Dataply returns a numeric PK immediately after insertion
  const pk = await db.insert(JSON.stringify(req.body))
  res.status(201).json({ id: pk, message: 'Post created!' })
})

app.get('/posts/:id', async (req, res) => {
  const data = await db.select(Number(req.params.id))
  if (!data) return res.status(404).send('Not Found')
  res.json(JSON.parse(data.toString()))
})

// Initialize DB before starting server
db.init().then(() => {
  app.listen(3000, () => console.log('Server running on http://localhost:3000'))
})

Tip

For more advanced usage like search and optimization, check the Technical Structure Guide and Performance Tuning Guide.

Transaction Management

Explicit Transactions

You can group multiple operations into a single unit of work to ensure atomicity. The withWriteTransaction method handles the transaction lifecycle automatically, committing on success and rolling back on failure.

await dataply.withWriteTransaction(async (tx) => {
  const pk = await dataply.insert('Data 1', tx)
  await dataply.update(pk, 'Updated Data', tx)
}) // Persists changes automatically on success or rolls back on failure

Global Transactions

You can perform atomic operations across multiple Dataply instances using the GlobalTransaction class. This safely acquires write locks on all instances sequentially and manages the transaction lifecycle to ensure either all instances commit successfully or all are rolled back.

import { Dataply, GlobalTransaction } from 'dataply'

const db1 = new Dataply('./db1.db', { wal: './db1.wal' })
const db2 = new Dataply('./db2.db', { wal: './db2.wal' })

await db1.init()
await db2.init()

try {
  await GlobalTransaction.Run([db1, db2], async ([tx1, tx2]) => {
    await db1.insert('Data for DB1', tx1)
    await db2.insert('Data for DB2', tx2)
  })
} catch (error) {
  console.error('Global transaction failed and rolled back.', error)
}

Auto-Transaction

If you omit the tx argument, Dataply creates an internal transaction for each operation.

Security: Atomicity is guaranteed even for single operations.
Optimization Tip: For bulk operations, use an explicit transaction to significantly reduce I/O overhead and increase performance.

API Reference

Dataply Class

`constructor(file: string, options?: DataplyOptions): Dataply`

Opens a database file. If the file does not exist, it creates and initializes a new one.

options.pageSize: Size of a page (Default: 8192, must be a power of 2)
options.pageCacheCapacity: Maximum number of pages to keep in memory (Default: 10000)
options.wal: Path to the WAL file. If omitted, WAL is disabled.
options.pagePreallocationCount: The number of pages to preallocate when creating a new page (Default: 1000).
options.walCheckpointThreshold: The total number of pages written to the WAL before automatically clearing it (Default: 1000).

`async init(): Promise<void>`

Initializes the instance. Must be called before performing any CRUD operations.

`async insert(data: string | Uint8Array, tx?: Transaction): Promise<number>`

Inserts new data. Returns the Primary Key (PK) of the created row.

`async insertAsOverflow(data: string | Uint8Array, tx?: Transaction): Promise<number>`

Forcibly inserts data into an overflow page, even if it could fit within a standard data page. Returns the Primary Key (PK).

`async insertBatch(dataList: (string | Uint8Array)[], tx?: Transaction): Promise<number[]>`

Inserts multiple rows at once. This is significantly faster than multiple individual inserts as it minimizes internal transaction overhead.

`async select(pk: number, asRaw?: boolean, tx?: Transaction): Promise<string | Uint8Array | null>`

Retrieves data based on the PK. Returns Uint8Array if asRaw is true.

`async selectMany(pks: number[] | Float64Array, asRaw?: boolean, tx?: Transaction): Promise<(string | Uint8Array | null)[]>`

Retrieves multiple data records in batch based on the provided PKs. This is more efficient than individual select calls for multiple lookups.

`async update(pk: number, data: string | Uint8Array, tx?: Transaction): Promise<void>`

Updates existing data.

`async delete(pk: number, tx?: Transaction): Promise<void>`

Marks data as deleted.

`async deleteBatch(pks: number[] | Float64Array, tx?: Transaction): Promise<void>`

Deletes multiple rows at once. This is significantly faster than multiple individual deletions as it minimizes internal transaction overhead.

`async getMetadata(tx?: Transaction): Promise<DataplyMetadata>`

Returns the current metadata of the dataply, including pageSize, pageCount, and rowCount.

`async withWriteTransaction<T>(callback: (tx: Transaction) => Promise<T>, tx?: Transaction): Promise<T>`

Executes write operations within a serialized write-lock transaction. Automatically commits on success and rolls back on failure if creating a new internal transaction.

`async withReadTransaction<T>(callback: (tx: Transaction) => Promise<T>, tx?: Transaction): Promise<T>`

Executes read operations within a transaction context.

`async *withReadStreamTransaction<T>(callback: (tx: Transaction) => AsyncGenerator<T>, tx?: Transaction): AsyncGenerator<T>`

Executes streaming read operations using an async generator in a transaction.

`async close(): Promise<void>`

Closes the file handles and shuts down safely.

GlobalTransaction Class

`static async Run<T>(dbs: Dataply[], callback: (txs: Transaction[]) => Promise<T>): Promise<T>`

Executes a callback-based global transaction across multiple Dataply instances sequentially locking them to prevent deadlocks, providing true atomicity within the Dataply nodes. Automatically commits on success and rolls back on failure.

Extending Dataply

If you want to extend Dataply's functionality, use the DataplyAPI class. Unlike the standard Dataply class, DataplyAPI provides direct access to internal components like PageFileSystem or RowTableEngine, offering much more flexibility for custom implementations.

For a detailed guide and examples on how to extend Dataply using Hooks, see Extending Dataply Guide.

Using DataplyAPI

import { DataplyAPI } from 'dataply'

class CustomDataply extends DataplyAPI {
  // Leverage internal protected members (pfs, rowTableEngine, etc.)
  async getInternalStats() {
    return {
      pageSize: this.options.pageSize,
      // Custom internal logic here
    }
  }
}

const custom = new CustomDataply('./data.db')
await custom.init()

const stats = await custom.getInternalStats()
console.log(stats)

Internal Architecture

Dataply implements the core principles of high-performance storage systems in a lightweight and efficient manner.

For a detailed visual guide on Dataply's internal architecture, class diagrams, and transaction flow, please refer to the Architecture Guide.

1. Architectural Principles

Layered Architecture: Clear separation of concerns between API, Engine, Page System, and I/O Strategy.
MVCC & Snapshot Isolation: Separation of read/write paths using Undo Snapshots.
WAL-based Durability: Sequential log writing for reliability and crash recovery.

2. Page-Based Storage and Caching

Fixed-size Pages: All data is managed in fixed-size units (default 8KB) called pages.
Page Cache: Minimizes disk I/O by caching frequently accessed pages in memory (LRU Strategy).
Dirty Page Tracking: Tracks modified pages (Dirty) to synchronize them with disk efficiently only at the time of commit.
Free List Management: Efficiently tracks the allocation and deallocation of pages using a Free List (stack-like structure), facilitating fast space reclamation and reuse. (The older Bitmap-based mechanism is deprecated but remains for backward compatibility). For more details on this mechanism, see Page Reclamation and Reuse Guide.
Detailed Structure: For technical details on the physical layout, see structure.md.

Page & Row Layout

Dataply uses a Slotted Page architecture to manage records efficiently:

Pages: Consists of a 100-byte header (containing type, id, checksum, etc.) and a body where rows are stored. Slot offsets are stored at the end of the page to track row positions.
Rows: Each row has a 9-byte header (flags, size, PK) followed by the actual data. Large records automatically trigger Overflow Pages to handle data exceeding page capacity.
Keys & Identifiers: Uses a 6-byte Primary Key (PK) for logical mapping and a 6-byte Record Identifier (RID) (Slot + Page ID) for direct physical addressing.

3. MVCC and Snapshot Isolation

Non-blocking Reads: Read operations are not blocked by write operations.
Undo Log: When a transaction modifies a page, it keeps the original data in an Undo Buffer. Other transactions trying to read the same page are served this snapshot to ensure consistent reads.
Rollback Mechanism: Upon transaction failure, the Undo Buffer is used to instantly restore pages to their original state.

4. WAL (Write-Ahead Logging) and Crash Recovery

Performance and Reliability: All changes are recorded in a sequential log file (WAL) before being written to the actual data file. This converts random writes into sequential writes for better performance and ensures data integrity.
Crash Recovery: When restarting after an unexpected shutdown, Dataply reads the WAL to automatically replay (Redo) any changes that weren't yet reflected in the data file.

5. Concurrency Control and Indexing

Page-level Locking: Prevents data contention by controlling sequential access to pages through the LockManager.
B+Tree Index: Uses a B+Tree structure guaranteeing $O(\log N)$ performance for maximized PK lookup efficiency.

Performance

Dataply is optimized for high-speed data processing. Automated benchmarks are executed on every push to the main branch to ensure consistent performance.

Performance Trend

You can view the real-time performance trend and detailed metrics on our Performance Dashboard.

Tip

Continuous Monitoring: We use github-action-benchmark to monitor performance changes. For every PR, a summary of the performance impact is automatically commented to help maintain high efficiency.

Limitations

As Dataply is currently in Alpha, there are several limitations to keep in mind:

PK-Only Access: Data can only be retrieved or modified using the Primary Key. No secondary indexes or complex query logic are available yet.
No SQL Support: This is a low-level Record Store. It does not support SQL or any higher-level query language.
Memory Usage: The Page cache size is controlled by pageCacheCapacity, but excessive use of large records should be handled with care.

Q&A

Q: Why should I use Dataply instead of a simple JSON file?

While JSON is simple, Dataply is designed for scalable and reliable data management:

Feature	JSON File Approach	Dataply Record Store
Memory usage	Loads entire file into RAM	Constant memory via page-based I/O
Search speed	Linear scan (O(N))	B+Tree Index lookups (O(log N))
Integrity	High risk of corruption on crash	Protected by WAL and Transactions
Concurrency	Single-user only	Multi-user via MVCC & Locking

Q: What can I build with Dataply?

Dataply is a low-level record store that provides high-performance ACID persistence. You can use it to build:

Simple Websites: Create forums or blogs using local files without complex database setup.
Post Identity Management: The Primary Key (PK) automatically generated and returned during insert can be directly used as a unique URL ID for posts (e.g., /post/1024).
Custom Storage Engines: Implement domain-specific document databases, caching layers, or log collectors.

Q: Can I extend Dataply to implement a full-featured database?

Absolutely! By leveraging DataplyAPI, you can implement custom indexing (like secondary indexes), query parsers, and complex data schemas. Dataply handles the difficult aspects of transaction management, crash recovery (WAL), and concurrency control, letting you focus on your database's unique features.

Q: How many rows can be inserted per page?

Dataply uses a 2-byte slots for data positioning within a page. This allows for a theoretical maximum of 65,536 ($2^{16}$) rows per page.

Q: What is the total maximum number of rows a database can hold?

With $2^{32}$ possible pages and $2^{16}$ rows per page, the theoretical limit is 281 trillion ($2^{48}$) rows. In practice, the limit is typically governed by the physical storage size (approx. 32TB for default settings).

Q: Is there a maximum database file size limit?

Using 4-byte (unsigned int) Page IDs and the default 8KB page size, Dataply can manage up to 32TB of data ($2^{32} \times 8KB$).

Q: Is WAL (Write-Ahead Logging) mandatory?

It is optional. While disabling WAL can improve write performance by reducing synchronous I/O, it is highly recommended for any production-like environment to ensure data integrity and automatic recovery after a system crash.

Q: How does Dataply ensure data consistency during concurrent access?

Dataply utilizes a combination of page-level locking and MVCC (Multi-Version Concurrency Control). This allows for Snapshot Isolation, meaning readers can access a consistent state of the data without being blocked by ongoing write operations.

Contributing

Contributions are welcome! Since Dataply is currently in its Alpha stage, your feedback, bug reports, and feature suggestions are invaluable for shaping the future of this project.

Report Bugs: If you find a bug, please open an issue with detailed steps to reproduce.
Suggest Features: Have an idea for a new feature? We'd love to hear it!
Submit PRs: Feel free to submit Pull Requests for bug fixes or improvements. Please ensure your code follows the existing style and includes appropriate tests.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 152 Commits
.github/workflows		.github/workflows
benchmark		benchmark
build		build
docs		docs
src		src
test		test
.gitignore		.gitignore
jest.config.js		jest.config.js
package-lock.json		package-lock.json
package.json		package.json
readme.md		readme.md
transaction_migration.md		transaction_migration.md
tsconfig.json		tsconfig.json

Folders and files

Latest commit

History

Repository files navigation

Dataply

Key Features

Installation

Prerequisites

Quick Start

Integration Example (Express.js)

Transaction Management

Explicit Transactions

Global Transactions

Auto-Transaction

API Reference

Dataply Class

constructor(file: string, options?: DataplyOptions): Dataply

async init(): Promise<void>

async insert(data: string | Uint8Array, tx?: Transaction): Promise<number>

async insertAsOverflow(data: string | Uint8Array, tx?: Transaction): Promise<number>

async insertBatch(dataList: (string | Uint8Array)[], tx?: Transaction): Promise<number[]>

async select(pk: number, asRaw?: boolean, tx?: Transaction): Promise<string | Uint8Array | null>

async selectMany(pks: number[] | Float64Array, asRaw?: boolean, tx?: Transaction): Promise<(string | Uint8Array | null)[]>

async update(pk: number, data: string | Uint8Array, tx?: Transaction): Promise<void>

async delete(pk: number, tx?: Transaction): Promise<void>

async deleteBatch(pks: number[] | Float64Array, tx?: Transaction): Promise<void>

async getMetadata(tx?: Transaction): Promise<DataplyMetadata>

async withWriteTransaction<T>(callback: (tx: Transaction) => Promise<T>, tx?: Transaction): Promise<T>

async withReadTransaction<T>(callback: (tx: Transaction) => Promise<T>, tx?: Transaction): Promise<T>

async *withReadStreamTransaction<T>(callback: (tx: Transaction) => AsyncGenerator<T>, tx?: Transaction): AsyncGenerator<T>

async close(): Promise<void>

GlobalTransaction Class

static async Run<T>(dbs: Dataply[], callback: (txs: Transaction[]) => Promise<T>): Promise<T>

Extending Dataply

Using DataplyAPI

Internal Architecture

1. Architectural Principles

2. Page-Based Storage and Caching

Page & Row Layout

3. MVCC and Snapshot Isolation

4. WAL (Write-Ahead Logging) and Crash Recovery

5. Concurrency Control and Indexing

Performance

Performance Trend

Limitations

Q&A

Q: Why should I use Dataply instead of a simple JSON file?

Q: What can I build with Dataply?

Q: Can I extend Dataply to implement a full-featured database?

Q: How many rows can be inserted per page?

Q: What is the total maximum number of rows a database can hold?

Q: Is there a maximum database file size limit?

Q: Is WAL (Write-Ahead Logging) mandatory?

Q: How does Dataply ensure data consistency during concurrent access?

Contributing

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`constructor(file: string, options?: DataplyOptions): Dataply`

`async init(): Promise<void>`

`async insert(data: string | Uint8Array, tx?: Transaction): Promise<number>`

`async insertAsOverflow(data: string | Uint8Array, tx?: Transaction): Promise<number>`

`async insertBatch(dataList: (string | Uint8Array)[], tx?: Transaction): Promise<number[]>`

`async select(pk: number, asRaw?: boolean, tx?: Transaction): Promise<string | Uint8Array | null>`

`async selectMany(pks: number[] | Float64Array, asRaw?: boolean, tx?: Transaction): Promise<(string | Uint8Array | null)[]>`

`async update(pk: number, data: string | Uint8Array, tx?: Transaction): Promise<void>`

`async delete(pk: number, tx?: Transaction): Promise<void>`

`async deleteBatch(pks: number[] | Float64Array, tx?: Transaction): Promise<void>`

`async getMetadata(tx?: Transaction): Promise<DataplyMetadata>`

`async withWriteTransaction<T>(callback: (tx: Transaction) => Promise<T>, tx?: Transaction): Promise<T>`

`async withReadTransaction<T>(callback: (tx: Transaction) => Promise<T>, tx?: Transaction): Promise<T>`

`async *withReadStreamTransaction<T>(callback: (tx: Transaction) => AsyncGenerator<T>, tx?: Transaction): AsyncGenerator<T>`

`async close(): Promise<void>`

`static async Run<T>(dbs: Dataply[], callback: (txs: Transaction[]) => Promise<T>): Promise<T>`

Packages