fifo-tool-datasets provides standardized adapters to convert plain .dat files into formats compatible with LLM training β including Hugging Face datasets.Dataset and JSON message arrays.
It supports both:
- β A Python SDK β for structured loading and conversion
- β
A CLI β to upload/download
.datfiles to/from the Hugging Face Hub
.dat files are plain-text datasets designed for LLM fine-tuning. They come in three styles:
- π¬
sqna(single-turn): prompt-response pairs - π§
conversation(multi-turn): role-tagged chat sessions - βοΈ
dsl(structured): system β input β DSL output triplets
These files are human-editable, diffable, and ideal for version control β especially during dataset development and iteration.
This tool enables a complete round-trip workflow:
- Create and edit a
.datfile locally - Convert and upload it as a training-ready Hugging Face
datasets.Dataset - Later, download and deserialize it back into
.datfor further edits
This gives you the best of both worlds:
- βοΈ Easy editing and version control via
.dat - π Compatibility with HF pipelines using
load_dataset()
See format examples below in each adapter section.
- π― Project Status & Audience
- π Dataset Formats
- π Conversion Matrix
- π¦ Installation
- π CLI Usage
- π¦ SDK Usage
- π Available Adapters
- β Validation Rules
- π§ͺ Tests
- β License
- π Third-Party Disclaimer
π§ Work in Progress β This project is in early development. π§
This is a personal project developed and maintained by a solo developer.
Contributions, ideas, and feedback are welcome, but development is driven by personal time and priorities.
Designed for individual users experimenting with LLM fine-tuning and creating their own fine-tuning datasets by hand.
No official release or pre-release has been published yet. The tool is provided for preview and experimentation.
Use at your own risk.
| Format | Description |
|---|---|
.dat |
Editable plain-text format with tags (e.g. >, <, ---) |
Dataset |
Hugging Face datasets.Dataset object β used for fine-tuning |
wide_dataset |
Flattened Dataset with one row per message β format depends on adapter |
json |
A list of messages dictionaries |
hub |
A DatasetDict with train, validation, and test splits |
All datasets uploaded to the Hub β if not already split β are automatically divided into train, validation, and test partitions using the wide format.
| From \ To | dataset | wide_dataset | dat | hub | json |
|---|---|---|---|---|---|
| dataset | β | β | π§© | β | β |
| wide_dataset | β | β | β | β | β |
| dat | π§© | β | β | π§© | β |
| hub | π§©π¦ | β π¦ | β | β | β |
| json | β | β | β | β | β |
Legend:
- β direct: single-step conversion. Hover to view the function name.
- π§© indirect: composed of helper conversions. Hover to view the function name.
- π¦ returns dict: result is a
DatasetDict.
Install both the CLI and SDK in one step:
git clone https://github.com/gh9869827/fifo-tool-datasets.git
cd fifo-tool-datasets
python3 -m pip install -e .This enables the fifo-tool-datasets command.
fifo-tool-datasets <command> [options]Upload a local .dat file or directory to the Hugging Face Hub.
fifo-tool-datasets upload <src> <dst> [--adapter <adapter>] --commit-message <msg> [--seed <int>]The source must be an existing local path and the destination must be in username/repo format. If --adapter is omitted when uploading a directory, it is read from .hf_meta.json.
Download a dataset from the Hugging Face Hub.
fifo-tool-datasets download <src> <dst> [--adapter <adapter>] [-y]The source must be in username/repo format. The destination can be a .dat file (merged) or a directory (one .dat per split). When downloading to a directory and --adapter is omitted, the CLI tries to read the adapter from the local .hf_meta.json file (if present from a previous download).
fifo-tool-datasets push [<dir>] --commit-message <msg> [-y]Push the dataset from the specified directory to the Hub using the repo_id and adapter from .hf_meta.json. Defaults to the current directory.
fifo-tool-datasets pull [<dir>]Pull the dataset into the specified directory using the repo_id and adapter from .hf_meta.json. Defaults to the current directory.
fifo-tool-datasets split <src> --adapter <adapter> [--to <dir>] [--split-ratio <train> <val> <test>] [-y]Default split ratio is [0.7, 0.15, 0.15] if --split-ratio is omitted.
Recombine split .dat files into a single dataset.
fifo-tool-datasets merge <dir> --adapter <adapter> [--to <file>] [-y]Sort the samples of a DSL .dat file by their full content: system prompt, user input, and assistant response. Sorting is done in place, meaning the original file is overwritten with the sorted result.
You can provide either a single file or a directory. If a directory is given, all .dat files within it will be sorted in place.
fifo-tool-datasets sort <path> [--adapter dsl]Currently, only the dsl adapter is supported. If the --adapter flag is omitted, it defaults to dsl automatically.
Show record counts and metadata for a .dat file or a split dataset directory. If <path> is omitted, the current directory is used by default.
fifo-tool-datasets info [<path>]Compare local files against the Hugging Face Hub.
fifo-tool-datasets diff [<dir>] [--type head|cache]By default, --type head compares against the latest files on the Hub.
Use --type cache to compare against the commit recorded in .hf_meta.json.
Remote splits are converted to temporary .dat files using the adapter so the comparison aligns with the local format.
Downloaded files are saved temporarily under the target directory and removed afterward.
When using upload or download with a directory source or target, the CLI automatically:
- Upload
README.mdandLICENSEfiles from the source directory if they exist - Download
README.mdandLICENSEfiles from the Hub if they are present in the remote repository - Create a
.hf_meta.jsonfile when downloading, storing the adapter, repo ID, download timestamp, and commit hash - Update
.hf_meta.jsonwith the new commit hash and timestamp after a successful upload - Use that metadata to verify the remote commit before upload
- Auto-detect the adapter on download if
--adapterisn't provided - Block uploads if the remote has changed since download, unless
-yis passed to override - Diff local vs. remote documentation files and skip upload if content has not changed
This ensures smooth syncing of documentation while minimizing the risk of overwriting others' changes.
β οΈ Note: While this provides lightweight safety for collaborative workflows, it does not offer Git-level guarantees. For strict versioning, conflict resolution, or rollback, consider usinggit cloneand managing pushes manually.
# Upload
fifo-tool-datasets upload dsl.dat username/my-dataset --adapter dsl --commit-message "init"
# Download (auto-detected adapter)
fifo-tool-datasets download username/my-dataset ./dsl_dir
# Download (explicit adapter override)
fifo-tool-datasets download username/my-dataset ./dsl_dir --adapter dsl
# Push updated data
fifo-tool-datasets push ./dsl_dir --commit-message "update"
# Pull latest version
fifo-tool-datasets pull ./dsl_dir
# Split
fifo-tool-datasets split dsl.dat --adapter dsl --to split_dsl
# Merge
fifo-tool-datasets merge split_dsl --adapter dsl --to full.dsl.dat
# Sort
fifo-tool-datasets sort dsl.dat --adapter dsl
# Info
fifo-tool-datasets info split_dsl
# Diff against the hub
fifo-tool-datasets diff split_dsl --type headfrom fifo_tool_datasets.sdk.hf_dataset_adapters.dsl import DSLAdapter
adapter = DSLAdapter()
# Upload to the Hugging Face Hub
adapter.from_dat_to_hub(
"dsl.dat",
"username/my-dataset",
commit_message="initial upload"
)
# Download from the Hub as a DatasetDict (train/validation/test)
splits = adapter.from_hub_to_dataset_dict("username/my-dataset")
# Access splits for fine-tuning
train_dataset = splits["train"]
test_dataset = splits["test"]
# You can now use train_dataset / test_dataset to fine-tune your LLM
# e.g., with Hugging Face Transformers Trainer, SFTTrainer, etc.
# You can also directly load from a local .dat file
dataset = adapter.from_dat_to_dataset("dsl.dat")
# Convert to structured JSON format
json_records = adapter.from_wide_dataset_to_json(dataset)---
$
You are a helpful assistant.
>
Hi
<
Hello!
---
[
{"id_conversation": 0, "id_message": 0, "role": "system", "content": "You are a helpful assistant."},
{"id_conversation": 0, "id_message": 1, "role": "user", "content": "Hi"},
{"id_conversation": 0, "id_message": 2, "role": "assistant", "content": "Hello!"}
][
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hi"},
{"role": "assistant", "content": "Hello!"}
]
}
]>What is 2+2?
<4
[
{"in": "What is 2+2?", "out": "4"}
][
{
"messages": [
{"role": "user", "content": "What is 2+2?"},
{"role": "assistant", "content": "4"}
]
}
]---
$ You are a precise DSL parser.
> today at 5:30PM
< SET_TIME(TODAY, 17, 30)
---
The DSL format optionally supports a ? (reasoning) section to capture chain-of-thought or intermediate reasoning steps:
---
$ You are a precise DSL parser.
> today at 5:30PM
?
base=TODAY
time.hour=17
time.minute=30
< SET_TIME(TODAY, 17, 30)
---
The ? section:
- Is optional and appears between the
>(input) and<(output) sections - Follows the same formatting rules as other sections (single-line or multi-line)
- Is stored in the wide format under the
reasoningfield - Is stored inline in the assistant message's
metadata.reasoningfield in JSON format
Multi-line entries are supported and can be freely mixed with single-line ones. A space after the marker on single-line entries is optional:
---
$
multi-line system
prompt
> single-line input
? single-line reasoning
<single-line output
---
To reuse the previous system prompt across multiple samples, use ...:
---
$ first prompt
> q1
< a1
---
$ ...
> q2
< a2
---
$
...
> q3
< a3
---
Any $ block that contains only ... β either directly after the $ or on the following line β will inherit the most recent explicitly defined system prompt.
- At least one non-
...system prompt is required in the file. - When generating
.datfiles, consecutive identical system prompts are automatically collapsed into$ ....
[
{"system": "You are a precise DSL parser.", "in": "today at 5:30PM", "out": "SET_TIME(TODAY, 17, 30)"}
]With reasoning:
[
{"system": "You are a precise DSL parser.", "in": "today at 5:30PM", "reasoning": "base=TODAY\ntime.hour=17\ntime.minute=30", "out": "SET_TIME(TODAY, 17, 30)"}
][
{
"messages": [
{"role": "system", "content": "You are a precise DSL parser."},
{"role": "user", "content": "today at 5:30PM"},
{"role": "assistant", "content": "SET_TIME(TODAY, 17, 30)"}
]
}
]With reasoning:
[
{
"messages": [
{"role": "system", "content": "You are a precise DSL parser."},
{"role": "user", "content": "today at 5:30PM"},
{
"role": "assistant",
"content": "SET_TIME(TODAY, 17, 30)",
"metadata": {
"reasoning": "base=TODAY\ntime.hour=17\ntime.minute=30"
}
}
]
}
]Each adapter enforces its own parsing rules:
ConversationAdapter: tag order, message required after each tag, conversation structureSQNAAdapter: strictly>then<, per pairDSLAdapter: each block must contain$,>,<in this order.$ ...reuses the previous system prompt. Values may span multiple lines; single-line values are written with a space after the tag when generating.datfiles. When writing.datfiles, consecutive identical system prompts are replaced by$ ...automatically.
pytest tests/MIT β see LICENSE
This project is not affiliated with or endorsed by Hugging Face or the Python Software Foundation.
It builds on their open-source technologies under their respective licenses.