This homework explores three complementary approaches to controlling the output of Large Language Models, progressing from token-level constraints to multi-sample selection to iterative self-correction with formal verification feedback.
The homework is organized in three parts, each building on concepts from the previous one:
| Part | Folder | Topic | Key idea |
|---|---|---|---|
| 1 | structured_generation/ |
Structured Generation | Constrain each token at generation time using regex-compiled FSMs |
| 2 | meta_generation/ |
Meta-Generation | Generate multiple samples and select the best using ranking strategies |
| 3 | self_correction/ |
Self-Correction | Iteratively refine code using verifier error feedback and tree search |
📄 Complete instructions: llm_controlled_generation.pdf
LLMs generate text token by token, sampling from the full vocabulary at each step. Structured generation guarantees that every generated token is part of a valid output by constraining the sampling at each step. You will build a pipeline that compiles a regex into a token-level finite state machine, then use it to mask logits during generation with a pure-NumPy GPT-2 implementation.
Modules: naive vs. DFA masking, character-to-token FSM compilation, coalescence & forced-token optimization, end-to-end constrained generation with picoGPT.
When a single LLM sample may or may not be correct, a natural idea is to generate multiple samples and select the best one. You will implement and compare four selection strategies — pass@1, best-of-n by log-probability, and two Minimum Bayes Risk (MBR) decoders using edit similarity and execution similarity — evaluated on the MBPP coding benchmark.
Modules: parallel sample generation, ranking by model confidence, MBR with Levenshtein distance, MBR with execution agreement.
Some tasks require provably correct code. You will build a tree-search refinement pipeline that generates formally verified Rust code using Verus. The LLM generates candidates, Verus checks them, and compiler error messages drive iterative refinement. A REBASE sampling strategy focuses search effort on the most promising branches.
Modules: parallel initial generation, Verus integration and scoring, code validation and trivial-proof detection, iterative refinement with tree search and REBASE.
Clone the repo and then run:
cd llm_controlled-generation
uv syncInstalled automatically via uv sync:
numpy,scipy— numerical computationregex,interegular— regex parsing and DFA construction (Part 1)transformers— tokenizer and model weights (Part 1)datasets— MBPP benchmark (Part 2)jellyfish— Levenshtein distance (Part 2)openai— OpenAI-compatible API client for Ollama (Parts 2 and 3)
ollama pull qwen2.5-coder:7b
ollama serveEach module can be run independently:
uv run python structured_generation/deterministic_finite_automaton.py # Naive vs. DFA masking benchmark
uv run python structured_generation/fsm_token.py # Character-to-token FSM demo
uv run python structured_generation/coalescence.py # Coalescence benchmark on GPT-2 vocab
uv run python structured_generation/picoGPT_generate.py # End-to-end constrained generationuv run python meta_generation/meta_generation.py # Evaluate selection strategies on MBPPuv run python self_correction/self_correction.py --verus-path /path/to/verusRun all required tests:
uv run pytest tests/test_structured_generation_required.py # Part 1
uv run pytest tests/test_meta_generation_required.py # Part 2
uv run pytest tests/test_self_correction_required.py # Part 3Or run all benchmarks and tests at once:
./test_and_submit.shllm_controlled-generation/
├── structured_generation/
│ ├── structured_generation.md # Part 1 description
│ ├── deterministic_finite_automaton.py # Naive vs. DFA masking
│ ├── fsm_token.py # Character-level DFA to token-level FSM
│ ├── coalescence.py # Mask deduplication and forced tokens
│ ├── picoGPT_generate.py # GPT-2 in NumPy + constrained generation
│ └── utils.py # GPT-2 weight loading
├── meta_generation/
│ ├── meta_generation.md # Part 2 description
│ ├── meta_generation.py # Sample generation and ranking strategies
│ └── utils.py # Prompt construction and safe execution
├── self_correction/
│ ├── self_correction.md # Part 3 description
│ ├── self_correction.py # Tree-search refinement pipeline
│ ├── utils.py # Verus integration, validation, scoring
│ └── examples/
│ └── incr_list.rs # Example Verus specification
├── tests/ # Test suite
├── pyproject.toml
└── README.md # This file
Apache 2.0