Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
121 changes: 121 additions & 0 deletions benchmarks/specfs_bench/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# SpecFS Benchmark

## Scenario Description

Formal specifications are a cornerstone of building trustworthy systems software — particularly file systems, where subtle correctness bugs can cause data loss or security breaches. SpecFS Bench evaluates whether LLMs can translate structured, Hoare-logic-style specifications of file system modules into correct, modular C implementations that match expert-written ground truth.

### Task Details

- **Input**
- Specification files (`.spec`) from `data/spec/`
- Generation / Judge prompt template
- Generation / Judge Model
- **Output**
- Generated C source file per spec under `outputs/<run>/generated/<relative_path>.c`
- Per-case generation records (`generation_results.jsonl`)
- Per-case judge verdicts (`judge_results.jsonl`)
- Aggregated summary (`summary.json`)
- **Evaluation**
- `LLM-as-a-judge`: compares generated code against ground-truth C code under the same spec, scoring functional equivalence on a 0–10 scale with verdict (`pass` / `partial` / `fail`), a short rationale, and a list of differences.

### Key Features

- Structured spec format combining natural-language intent with formal pre/post-conditions
- Modular generation: the model produces a single module while reusing pre-defined APIs from other modules listed in `[RELY]`
- Fully automated two-phase pipeline: code generation → LLM-based judging
- Configurable generation and judge models

## Benchmark Setup

### Prerequisites

- Python 3.9+
- LLM endpoint configured in `env.toml`

### Configuration

Copy the example config and fill in credentials:

```bash
cp env.toml.example env.toml
```

Edit `benchmarks/specfs/env.toml`:

```toml
[llm]
OPENAI_API_KEY = "your_key_here"
OPENAI_API_BASE = "your_url_here"
```

If no local `env.toml` is found, the benchmark falls back to the workspace-level `env.toml` one directory above, and finally to environment variables.

### Install Dependencies

```bash
cd benchmarks/specfs
./install.sh
```

This script creates `.venv/` and installs `benchmarks/specfs/requirements.txt`.

### Run the Benchmark

```bash
./run.sh <model_name>
```

The wrapper invokes `src/main.py`, which:

1. Discovers all `.spec` files under `data/spec/` recursively.
2. Generates a C implementation for each spec using the generation prompt.
3. Judges each generated file against the corresponding ground truth under `data/code/` using the judge prompt.
4. Writes results under:

```
outputs/specfsbench__<model>__judge_<judge_model>__<timestamp>/
├── generated/ # Generated C files mirroring data/spec/ layout
├── generation_results.jsonl
├── judge_results.jsonl
└── summary.json # Aggregated stats and average score
```

## Evaluation Metrics

SpecFS Bench uses an LLM-as-a-judge protocol to score each generated module against the reference implementation under the same spec.

For every case, the judge receives the spec, the generated code, and the ground-truth code, and returns a structured JSON verdict:

```json
{
"score": 0-10,
"verdict": "pass" | "partial" | "fail",
"summary": "short rationale",
"differences": ["difference 1", "difference 2"]
}
```

Aggregated metrics can be found in `summary.json`.


## Project Structure

```
benchmarks/specfs/
├── data/
│ ├── spec/ # Spec files (.spec) organized by category
│ └── code/ # Ground-truth C implementations (mirrors data/spec/)
├── src/
│ ├── main.py # Entry point: discover → generate → judge → summarize
│ ├── evaluator.py # LLM-as-a-judge evaluator
│ ├── genspec_prompt.md # Prompt template for code generation
│ └── judge_prompt.md # Prompt template for the LLM judge
├── tests/
├── logs/ # Run and install logs
├── outputs/ # Per-run outputs (generated code, jsonl, summary)
├── env.toml.example # Example configuration
├── install.sh # Environment setup script
├── run.sh # Benchmark entry wrapper
├── requirements.txt
└── README.md
```
21 changes: 21 additions & 0 deletions benchmarks/specfs_bench/data/code/file/file_allocate.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#include "file.h"

void file_allocate(struct inode *node, unsigned offset, unsigned len) {
unsigned start_page = offset / PG_SIZE;
unsigned end_page = (offset + len - 1) / PG_SIZE;

if (end_page >= INDEXTB_NUM) {
end_page = INDEXTB_NUM - 1;
}

for (unsigned i = start_page; i <= end_page; i++) {
if (node->file->index[i] == NULL) {
node->file->index[i] = malloc_page();
}
}

unsigned new_size = offset + len;
if (new_size > node->size) {
node->size = new_size;
}
}
7 changes: 7 additions & 0 deletions benchmarks/specfs_bench/data/code/file/file_clear.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#include "file.h"

void file_clear(struct inode *node, unsigned start, unsigned len) {
// According to [RELY], file_write writes zeroes if data is NULL.
// We utilize this behavior to clear the specified range.
file_write(node, start, len, NULL);
}
22 changes: 22 additions & 0 deletions benchmarks/specfs_bench/data/code/file/file_read.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#include "file.h"

void file_read(struct inode *node, unsigned offset, unsigned len, char *data) {
unsigned current_offset = offset;
unsigned remaining = len;
char *dst = data;

while (remaining > 0) {
unsigned page_idx = current_offset >> 12;
unsigned page_offset = current_offset & 0xFFF;
unsigned char *page_ptr = node->file->index[page_idx];

unsigned bytes_in_page = PG_SIZE - page_offset;
unsigned to_copy = (remaining < bytes_in_page) ? remaining : bytes_in_page;

memcpy(dst, page_ptr + page_offset, to_copy);

dst += to_copy;
current_offset += to_copy;
remaining -= to_copy;
}
}
35 changes: 35 additions & 0 deletions benchmarks/specfs_bench/data/code/file/file_write.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
#include "file.h"

void file_write(struct inode *node, unsigned offset, unsigned len, const char *data) {
if (len == 0 || node == NULL || node->file == NULL) {
return;
}

struct indextb *tb = node->file;
unsigned current_offset = offset;
unsigned remaining = len;
unsigned data_idx = 0;

while (remaining > 0) {
unsigned page_idx = current_offset / PAGE_SIZE;
unsigned page_offset = current_offset % PAGE_SIZE;
unsigned bytes_in_page = PAGE_SIZE - page_offset;
unsigned bytes_to_write = (remaining < bytes_in_page) ? remaining : bytes_in_page;

unsigned char *page = tb->index[page_idx];

if (data != NULL) {
for (unsigned i = 0; i < bytes_to_write; i++) {
page[page_offset + i] = data[data_idx + i];
}
} else {
for (unsigned i = 0; i < bytes_to_write; i++) {
page[page_offset + i] = 0;
}
}

current_offset += bytes_to_write;
remaining -= bytes_to_write;
data_idx += bytes_to_write;
}
}
28 changes: 28 additions & 0 deletions benchmarks/specfs_bench/data/code/inode/inode_delete.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
#include "inode.h"

struct inode* inode_delete(struct inode* inum, char* name) {
unsigned int n = hash_name(name);
struct entry *prev = NULL;
struct entry *curr = inum->dir->tb[n];

while (curr != NULL) {
if (strcmp(curr->name, name) == 0) {
struct inode* ret_inum = curr->inum;

if (prev == NULL) {
inum->dir->tb[n] = curr->next;
} else {
prev->next = curr->next;
}

free_entry(curr);
inum->size--;

return ret_inum;
}
prev = curr;
curr = curr->next;
}

return NULL;
}
15 changes: 15 additions & 0 deletions benchmarks/specfs_bench/data/code/inode/inode_find.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#include "inode.h"

struct inode *inode_find(struct inode *node, char *name) {
unsigned int n = hash_name(name);
struct entry *curr = node->dir->tb[n];

while (curr != NULL) {
if (strcmp(curr->name, name) == 0) {
return (struct inode *)curr->inum;
}
curr = curr->next;
}

return NULL;
}
25 changes: 25 additions & 0 deletions benchmarks/specfs_bench/data/code/inode/inode_insert.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
#include "inode.h"

int inode_insert(struct inode* cur, struct inode* inum, char* name) {
unsigned int n = hash_name(name);
struct entry *new_entry = malloc_entry();

if (new_entry == NULL) {
return 1;
}

char *new_name = malloc_string(name);
if (new_name == NULL) {
free(new_entry);
return 1;
}

new_entry->name = new_name;
new_entry->inum = inum;
new_entry->next = cur->dir->tb[n];

cur->dir->tb[n] = new_entry;
cur->size++;

return 0;
}
22 changes: 22 additions & 0 deletions benchmarks/specfs_bench/data/code/inode/inode_read.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#include "inode.h"

struct read_ret* inode_read(struct inode* node, unsigned len, unsigned offset) {
struct read_ret* ret = malloc_readret();

if (offset >= node->size || len == 0) {
ret->num = 0;
ret->buf = NULL;
return ret;
}

unsigned remaining = node->size - offset;
unsigned actual_len = (len < remaining) ? len : remaining;

char* buf = malloc_buffer(actual_len);
file_read(node, offset, actual_len, buf);

ret->buf = buf;
ret->num = actual_len;

return ret;
}
13 changes: 13 additions & 0 deletions benchmarks/specfs_bench/data/code/inode/inode_truncate.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
#include "inode.h"

void inode_truncate(struct inode* node, unsigned size) {
if (size < node->size) {
file_clear(node, size, node->size - size);
node->size = size;
} else if (size > node->size) {
unsigned diff = size - node->size;
file_allocate(node, node->size, diff);
file_clear(node, node->size, diff);
node->size = size;
}
}
34 changes: 34 additions & 0 deletions benchmarks/specfs_bench/data/code/inode/inode_write.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
#include "inode.h"

unsigned inode_write(struct inode* node, const char* buffer, unsigned len, unsigned offset) {
unsigned end = offset + len;

// Calculate the new size, capped at MAX_FILE_SIZE
if (end > MAX_FILE_SIZE) {
end = MAX_FILE_SIZE;
}

// If the offset is beyond the calculated end, no bytes can be written
if (offset >= end) {
return 0;
}

unsigned new_size = end;
unsigned written = new_size - offset;

// If the write extends beyond the current file size, allocate and clear new space
if (new_size > node->size) {
unsigned alloc_start = node->size;
unsigned alloc_len = new_size - node->size;

file_allocate(node, alloc_start, alloc_len);
file_clear(node, alloc_start, alloc_len);

node->size = new_size;
}

// Write the data to the file
file_write(node, offset, written, buffer);

return written;
}
29 changes: 29 additions & 0 deletions benchmarks/specfs_bench/data/code/interface-util/check_del.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
#include "interface-util.h"

int check_del(struct inode *cur, char *name) {
struct inode *target;

// Pre-condition: cur lock is already held

if (cur == NULL) {
return 1;
}

target = inode_find(cur, name);

if (target == NULL) {
unlock(cur);
return 1;
}

// Check if deletion is permissible (file or empty directory)
if (target->mode != DIR_MODE || target->size == 0) {
// Success: lock target, return with both locks held
lock(target);
return 0;
} else {
// Failure: non-empty directory, release cur lock
unlock(cur);
return 1;
}
}
Loading
Loading