Skip to content

gh9869827/fifo-tool-datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

96 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

License: MIT Python Test Status

fifo-tool-datasets

fifo-tool-datasets provides standardized adapters to convert plain .dat files into formats compatible with LLM training β€” including Hugging Face datasets.Dataset and JSON message arrays.

It supports both:

  • βœ… A Python SDK β€” for structured loading and conversion
  • βœ… A CLI β€” to upload/download .dat files to/from the Hugging Face Hub

.dat files are plain-text datasets designed for LLM fine-tuning. They come in three styles:

  • πŸ’¬ sqna (single-turn): prompt-response pairs
  • 🧠 conversation (multi-turn): role-tagged chat sessions
  • βš™οΈ dsl (structured): system β†’ input β†’ DSL output triplets

These files are human-editable, diffable, and ideal for version control β€” especially during dataset development and iteration.

This tool enables a complete round-trip workflow:

  1. Create and edit a .dat file locally
  2. Convert and upload it as a training-ready Hugging Face datasets.Dataset
  3. Later, download and deserialize it back into .dat for further edits

This gives you the best of both worlds:

  • ✍️ Easy editing and version control via .dat
  • πŸš€ Compatibility with HF pipelines using load_dataset()

See format examples below in each adapter section.


πŸ“š Table of Contents


🎯 Project Status & Audience

🚧 Work in Progress β€” This project is in early development. 🚧

This is a personal project developed and maintained by a solo developer.
Contributions, ideas, and feedback are welcome, but development is driven by personal time and priorities.

Designed for individual users experimenting with LLM fine-tuning and creating their own fine-tuning datasets by hand.

No official release or pre-release has been published yet. The tool is provided for preview and experimentation.
Use at your own risk.


πŸ“ Dataset Formats

Format Description
.dat Editable plain-text format with tags (e.g. >, <, ---)
Dataset Hugging Face datasets.Dataset object β€” used for fine-tuning
wide_dataset Flattened Dataset with one row per message β€” format depends on adapter
json A list of messages dictionaries
hub A DatasetDict with train, validation, and test splits

All datasets uploaded to the Hub β€” if not already split β€” are automatically divided into train, validation, and test partitions using the wide format.


πŸ” Conversion Matrix

From \ To dataset wide_dataset dat hub json
dataset β€” βœ… 🧩 β€” β€”
wide_dataset β€” β€” βœ… β€” βœ…
dat 🧩 βœ… β€” 🧩 β€”
hub πŸ§©πŸ“¦ βœ…πŸ“¦ β€” β€” β€”
json β€” β€” β€” β€” β€”

Legend:

  • βœ… direct: single-step conversion. Hover to view the function name.
  • 🧩 indirect: composed of helper conversions. Hover to view the function name.
  • πŸ“¦ returns dict: result is a DatasetDict.

πŸ“¦ Installation

Install both the CLI and SDK in one step:

git clone https://github.com/gh9869827/fifo-tool-datasets.git

cd fifo-tool-datasets

python3 -m pip install -e .

This enables the fifo-tool-datasets command.


πŸš€ CLI Usage

πŸ› οΈ Command Reference

fifo-tool-datasets <command> [options]

upload

Upload a local .dat file or directory to the Hugging Face Hub.

fifo-tool-datasets upload <src> <dst> [--adapter <adapter>] --commit-message <msg> [--seed <int>]

The source must be an existing local path and the destination must be in username/repo format. If --adapter is omitted when uploading a directory, it is read from .hf_meta.json.

download

Download a dataset from the Hugging Face Hub.

fifo-tool-datasets download <src> <dst> [--adapter <adapter>] [-y]

The source must be in username/repo format. The destination can be a .dat file (merged) or a directory (one .dat per split). When downloading to a directory and --adapter is omitted, the CLI tries to read the adapter from the local .hf_meta.json file (if present from a previous download).

push

fifo-tool-datasets push [<dir>] --commit-message <msg> [-y]

Push the dataset from the specified directory to the Hub using the repo_id and adapter from .hf_meta.json. Defaults to the current directory.

pull

fifo-tool-datasets pull [<dir>]

Pull the dataset into the specified directory using the repo_id and adapter from .hf_meta.json. Defaults to the current directory.

split

fifo-tool-datasets split <src> --adapter <adapter> [--to <dir>] [--split-ratio <train> <val> <test>] [-y]

Default split ratio is [0.7, 0.15, 0.15] if --split-ratio is omitted.

merge

Recombine split .dat files into a single dataset.

fifo-tool-datasets merge <dir> --adapter <adapter> [--to <file>] [-y]

sort

Sort the samples of a DSL .dat file by their full content: system prompt, user input, and assistant response. Sorting is done in place, meaning the original file is overwritten with the sorted result.

You can provide either a single file or a directory. If a directory is given, all .dat files within it will be sorted in place.

fifo-tool-datasets sort <path> [--adapter dsl]

Currently, only the dsl adapter is supported. If the --adapter flag is omitted, it defaults to dsl automatically.

info

Show record counts and metadata for a .dat file or a split dataset directory. If <path> is omitted, the current directory is used by default.

fifo-tool-datasets info [<path>]

diff

Compare local files against the Hugging Face Hub.

fifo-tool-datasets diff [<dir>] [--type head|cache]

By default, --type head compares against the latest files on the Hub.
Use --type cache to compare against the commit recorded in .hf_meta.json.

Remote splits are converted to temporary .dat files using the adapter so the comparison aligns with the local format.
Downloaded files are saved temporarily under the target directory and removed afterward.

πŸ”„ Documentation and Metadata Sync

When using upload or download with a directory source or target, the CLI automatically:

  • Upload README.md and LICENSE files from the source directory if they exist
  • Download README.md and LICENSE files from the Hub if they are present in the remote repository
  • Create a .hf_meta.json file when downloading, storing the adapter, repo ID, download timestamp, and commit hash
  • Update .hf_meta.json with the new commit hash and timestamp after a successful upload
  • Use that metadata to verify the remote commit before upload
  • Auto-detect the adapter on download if --adapter isn't provided
  • Block uploads if the remote has changed since download, unless -y is passed to override
  • Diff local vs. remote documentation files and skip upload if content has not changed

This ensures smooth syncing of documentation while minimizing the risk of overwriting others' changes.

⚠️ Note: While this provides lightweight safety for collaborative workflows, it does not offer Git-level guarantees. For strict versioning, conflict resolution, or rollback, consider using git clone and managing pushes manually.


πŸ’‘ Command examples

# Upload
fifo-tool-datasets upload dsl.dat username/my-dataset --adapter dsl --commit-message "init"

# Download (auto-detected adapter)
fifo-tool-datasets download username/my-dataset ./dsl_dir

# Download (explicit adapter override)
fifo-tool-datasets download username/my-dataset ./dsl_dir --adapter dsl

# Push updated data
fifo-tool-datasets push ./dsl_dir --commit-message "update"

# Pull latest version
fifo-tool-datasets pull ./dsl_dir

# Split
fifo-tool-datasets split dsl.dat --adapter dsl --to split_dsl

# Merge
fifo-tool-datasets merge split_dsl --adapter dsl --to full.dsl.dat

# Sort
fifo-tool-datasets sort dsl.dat --adapter dsl

# Info
fifo-tool-datasets info split_dsl

# Diff against the hub
fifo-tool-datasets diff split_dsl --type head

πŸ“¦ SDK Usage

from fifo_tool_datasets.sdk.hf_dataset_adapters.dsl import DSLAdapter

adapter = DSLAdapter()

# Upload to the Hugging Face Hub
adapter.from_dat_to_hub(
    "dsl.dat",
    "username/my-dataset",
    commit_message="initial upload"
)

# Download from the Hub as a DatasetDict (train/validation/test)
splits = adapter.from_hub_to_dataset_dict("username/my-dataset")

# Access splits for fine-tuning
train_dataset = splits["train"]
test_dataset = splits["test"]

# You can now use train_dataset / test_dataset to fine-tune your LLM
# e.g., with Hugging Face Transformers Trainer, SFTTrainer, etc.

# You can also directly load from a local .dat file
dataset = adapter.from_dat_to_dataset("dsl.dat")

# Convert to structured JSON format
json_records = adapter.from_wide_dataset_to_json(dataset)

πŸ”Œ Available Adapters

🧠 ConversationAdapter

.dat

---
$
You are a helpful assistant.
>
Hi
<
Hello!
---

Wide Format

[
  {"id_conversation": 0, "id_message": 0, "role": "system", "content": "You are a helpful assistant."},
  {"id_conversation": 0, "id_message": 1, "role": "user",   "content": "Hi"},
  {"id_conversation": 0, "id_message": 2, "role": "assistant", "content": "Hello!"}
]

JSON Format

[
  {
    "messages": [
      {"role": "system",    "content": "You are a helpful assistant."},
      {"role": "user",      "content": "Hi"},
      {"role": "assistant", "content": "Hello!"}
    ]
  }
]

πŸ’¬ SQNAAdapter

.dat

>What is 2+2?
<4

Wide Format

[
  {"in": "What is 2+2?", "out": "4"}
]

JSON Format

[
  {
    "messages": [
      {"role": "user", "content": "What is 2+2?"},
      {"role": "assistant", "content": "4"}
    ]
  }
]

βš™οΈ DSLAdapter

.dat

---
$ You are a precise DSL parser.
> today at 5:30PM
< SET_TIME(TODAY, 17, 30)
---

Reasoning Support

The DSL format optionally supports a ? (reasoning) section to capture chain-of-thought or intermediate reasoning steps:

---
$ You are a precise DSL parser.
> today at 5:30PM
?
base=TODAY
time.hour=17
time.minute=30
< SET_TIME(TODAY, 17, 30)
---

The ? section:

  • Is optional and appears between the > (input) and < (output) sections
  • Follows the same formatting rules as other sections (single-line or multi-line)
  • Is stored in the wide format under the reasoning field
  • Is stored inline in the assistant message's metadata.reasoning field in JSON format

Multi-line entries are supported and can be freely mixed with single-line ones. A space after the marker on single-line entries is optional:

---
$
multi-line system
prompt
> single-line input
? single-line reasoning
<single-line output
---

To reuse the previous system prompt across multiple samples, use ...:

---
$ first prompt
> q1
< a1
---
$ ...
> q2
< a2
---
$
...
> q3
< a3
---

Any $ block that contains only ... β€” either directly after the $ or on the following line β€” will inherit the most recent explicitly defined system prompt.

  • At least one non-... system prompt is required in the file.
  • When generating .dat files, consecutive identical system prompts are automatically collapsed into $ ....

Wide Format

[
  {"system": "You are a precise DSL parser.", "in": "today at 5:30PM", "out": "SET_TIME(TODAY, 17, 30)"}
]

With reasoning:

[
  {"system": "You are a precise DSL parser.", "in": "today at 5:30PM", "reasoning": "base=TODAY\ntime.hour=17\ntime.minute=30", "out": "SET_TIME(TODAY, 17, 30)"}
]

JSON Format

[
  {
    "messages": [
      {"role": "system", "content": "You are a precise DSL parser."},
      {"role": "user", "content": "today at 5:30PM"},
      {"role": "assistant", "content": "SET_TIME(TODAY, 17, 30)"}
    ]
  }
]

With reasoning:

[
  {
    "messages": [
      {"role": "system", "content": "You are a precise DSL parser."},
      {"role": "user", "content": "today at 5:30PM"},
      {
        "role": "assistant",
        "content": "SET_TIME(TODAY, 17, 30)",
        "metadata": {
          "reasoning": "base=TODAY\ntime.hour=17\ntime.minute=30"
        }
      }
    ]
  }
]

βœ… Validation Rules

Each adapter enforces its own parsing rules:

  • ConversationAdapter: tag order, message required after each tag, conversation structure
  • SQNAAdapter: strictly > then <, per pair
  • DSLAdapter: each block must contain $, >, < in this order. $ ... reuses the previous system prompt. Values may span multiple lines; single-line values are written with a space after the tag when generating .dat files. When writing .dat files, consecutive identical system prompts are replaced by $ ... automatically.

πŸ§ͺ Tests

pytest tests/

βœ… License

MIT β€” see LICENSE


πŸ“„ Third-Party Disclaimer

This project is not affiliated with or endorsed by Hugging Face or the Python Software Foundation.
It builds on their open-source technologies under their respective licenses.

About

Python SDK and CLI for managing .dat files as editable, version-controlled datasets, with support for multiple format adapters and seamless integration with the Hugging Face Hub. Ideal for handcrafting and maintaining datasets for LLM fine-tuning.

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages