DeAI — Decision AI for Turkish Legal Question Answering

An end-to-end retrieval-augmented generation (RAG) system for Turkish legal QA, with domain-adapted dense embeddings, cross-encoder reranking, and QLoRA-tuned generation.

🎯 Ablation Results · 🛠️ CLI Usage · 📊 Evaluation Rubric

Highlights

Three independently fine-tuned components evaluated under the rubric's Scenario 1 formula F = 0.35·R + 0.4·(A·R) + 0.25·(G·R) on a 240-question instructor gold benchmark:

Metric	Baseline	Full Pipeline	Δ
Retrieval R = (Recall@10 + MRR@10) / 2	0.7352	0.8483	+11.3 pts
Reranker held-out F1	0.331	0.911	+57.9 pts
Answer F1 (vs. gold)	0.406	0.452	+11.3% relative
Citation compliance	68%	78%	+10 pp
Scenario 1 Final F	0.6078	0.6130	+0.9%

System Overview

Question
  │
  ▼
┌─────────────────┐
│  Embedding Layer │  ← BAAI/bge-m3 (fine-tuned)
│  (Hybrid Search) │     Dense + Sparse + ColBERT
└────────┬────────┘
         │  top-k passages
         ▼
┌─────────────────┐
│    Reranker      │  ← Jina Reranker v2 (fine-tuned)
│  (Cross-Encoder) │     Token-level query-doc interaction
└────────┬────────┘
         │  top-n reranked
         ▼
┌─────────────────┐
│   LLM Generator  │  ← Qwen3-4B (QLoRA fine-tuned)
│  (RAG Prompting) │     Citation-aware answer generation
└────────┬────────┘
         │
         ▼
   Grounded Answer
   with [1][2] citations

The query is encoded by bge-m3, which produces both dense and sparse representations for hybrid retrieval. FAISS then returns the top-k candidate chunks ranked by inner product over L2-normalized vectors (equivalent to cosine similarity). A cross-encoder reranker re-scores the candidates using full token-level attention between the query and each passage, which is more accurate than the bi-encoder dot product but too expensive to apply at corpus scale. Finally, the LLM generates an answer grounded in the reranked passages and emits inline citation markers (e.g., [1][2]) that point back to the source chunks.

Three stages, three independently fine-tuned components:

Dense retrieval — bge-m3 fine-tuned with hard-negative MNRL on 1,853 instructor-supplied (query, positive, hard-negative) triples.
Cross-encoder reranking — jina-reranker-v2-base-multilingual fine-tuned with binary cross-entropy on 6,076 (query, candidate, relevance) triples.
Grounded generation — Qwen3-4B-Instruct-2507 adapted via QLoRA (r=16, all attention + MLP modules, 33M trainable params) on 4,900 supervised QA examples with Kaynak: citation format.

All training fits on a single Tesla T4 (15.6 GB VRAM); total training time ~2 hours.

Quick Start

# Install
pip install -r requirements.txt

# Index a corpus (use instructor data or your own JSONL)
python scripts/deai_cli.py index --corpus data/instructor/corpus.jsonl \
  --output-dir data/processed/faiss_index

# Ask a question
python scripts/deai_cli.py query "Kıdem tazminatı hakkı ne zaman doğar?"

# Run the full ablation evaluation
python scripts/deai_cli.py eval --benchmark data/instructor/gold_benchmark.json

# Launch the interactive Gradio demo
python scripts/demo.py

The CLI accepts custom corpora and custom benchmarks (rubric requirements 3 and 4) — see docs/cli_usage.md.

Running with Your Own Data

The system loads all three fine-tuned models from the Hugging Face Hub (no local weights needed):

Embedder: anilkaracay/bge-m3-legal-tr
Reranker: anilkaracay/jina-reranker-legal-tr
LLM adapter: anilkaracay/qwen3-4b-legal-tr-qlora (on Qwen/Qwen3-4B-Instruct-2507)

Install

pip install -r requirements.txt
pip install -r requirements-inference.txt   # torch, peft, accelerate (GPU recommended)

1. Index your own corpus

Corpus format (JSONL, one document per line):

{"id": "doc_1", "text": "...", "title": "...", "metadata": {}}

python scripts/deai_cli.py index \
  --corpus your_corpus.jsonl \
  --output-dir your_index \
  --embedding-model anilkaracay/bge-m3-legal-tr

2. Query (with full generation)

python scripts/deai_cli.py query "your question" \
  --index-dir your_index --generate

3. Evaluate on your own benchmark

Benchmark format (JSON array). Supported schema:

[{"question_id": "q1", "question": "...", "verified_answer": "...",
  "gold_sources": [{"corpus_row_id": "doc_1"}]}]

python scripts/deai_cli.py eval \
  --benchmark your_benchmark.json \
  --index-dir your_index \
  --output-dir eval_results \
  --top-k 5 --use-judge

This runs the 5-config ablation and writes a JSON + Markdown report with Scenario 1 scores (R, A, G, final) per config.

GPU note: the LLM and reranker stages require a CUDA GPU. On CPU-only machines, retrieval-only configs still produce metrics; LLM configs degrade gracefully.

Repository Structure

DeAI/
├── configs/                  # OmegaConf YAML configs for each training stage
│   ├── embedding_ft.yaml
│   ├── reranker_ft.yaml
│   └── llm_ft.yaml
├── data/
│   ├── instructor/           # Provided benchmark (corpus + train + gold)
│   └── gold_qa/              # Internal evaluation sets
├── docs/                     # All user-facing documentation
│   ├── cli_usage.md          # CLI reference
│   ├── evaluation_rubric.md  # Scenario 1/2/3 formulae and metrics
│   ├── training.md           # Embedding + reranker fine-tuning guide
│   ├── qlora_training.md     # LLM QLoRA workflow + memory estimates
│   └── kaggle_setup.md       # Reproducing the training runs on Kaggle
├── results/
│   └── ablation/             # All evaluation outputs (JSON + Markdown)
├── scripts/
│   ├── deai_cli.py           # Unified CLI (index/query/eval/demo)
│   ├── demo.py               # Gradio interactive demo
│   └── run_baseline.py       # Reference baseline run
├── src/
│   ├── data/                 # Loaders, chunking, Turkish-aware cleaning
│   ├── retrieval/            # FAISS + BM25 hybrid index
│   ├── training/             # Trainers for embedder/reranker/LLM
│   ├── evaluation/           # Metrics, LLM-as-judge, rubric, ablation runner
│   └── pipeline.py           # End-to-end RAG composition
└── tests/                    # 88 passing tests

Datasets

The project is built on the ORICON Turkish legal corpus, a curated dataset package provided by a third-party data provider. It bundles a pre-chunked document collection together with matching training and evaluation splits for every stage of the RAG pipeline.

The files live under data/instructor/ and are excluded from version control (see .gitignore); a regenerated statistics summary is checked in at data/instructor/instructor_data_report.md.

File	Records	Purpose
`corpus.jsonl`	~7,579 chunks	Document collection (the indexed corpus)
`embedding.jsonl`	~2,059 pairs	Query / positive / hard-negative triples for embedding contrastive fine-tuning
`reranker.jsonl`	~6,752 triples	Query / candidate / label data for cross-encoder reranker fine-tuning
`llm.jsonl`	~13,758 examples	Chat-format supervised fine-tuning data for the generator LLM
`gold_benchmark.json`	~240 questions	Manually verified Q-A-relevant_docs gold benchmark
`rag_eval.json`	~1,000 queries	Broader retrieval evaluation set with gold chunk ids and citation labels

Run python scripts/explore_instructor_data.py to regenerate the statistics report and verify that the files load with the expected schema.

Custom corpus and benchmark inputs are supported via scripts/deai_cli.py — the pipeline can be pointed at any user-provided dataset (JSONL, JSON, CSV, or plain text). See docs/cli_usage.md for the full guide.

Training

Each component is trained independently. Hyperparameters and detailed walkthroughs:

Embedder (bge-m3, MNRL): docs/training.md — 17 min on T4
Reranker (jina-v2, BCE): docs/training.md — 8 min on T4
LLM (Qwen3-4B, QLoRA): docs/qlora_training.md — 99 min on T4

To reproduce on Kaggle, follow docs/kaggle_setup.md.

Evaluation

The instructor's rubric defines three scenarios depending on available gold data; with our (question, answer, relevant-docs) gold benchmark we score under Scenario 1:

F = 0.35·R + 0.4·(A·R) + 0.25·(G·R)

where R = retrieval (Recall@10 + MRR@10) / 2, A = token-level F1 against the verified gold answer, G = LLM-as-judge faithfulness (normalized to [0, 1]). A and G are gated by R so that confident wrong answers over wrong evidence are penalized.

Full ablation across 5 cumulative configurations (baseline → +embed → +rerank → +base LLM → +QLoRA LLM):

→ results/ablation/ablation_report_final.md

Methodology details, rubric mathematics, and judge-model rationale: docs/evaluation_rubric.md.

Reproducibility

Hardware: NVIDIA Tesla T4 (15.6 GB VRAM), single GPU, Kaggle free tier
Software: PyTorch 2.10.0, Transformers 4.57.3, Unsloth 2025.12.10, sentence-transformers, bitsandbytes (4-bit NF4 throughout)
Random seed: 42 (data shuffle, train split, evaluation subset)
Total compute: ~2 hours training + ~1 hour evaluation
Frozen artifacts: trained adapter weights are available on request

All training was run on Kaggle public notebooks; configurations and step-by-step setup at docs/kaggle_setup.md.

Documentation

Document	Purpose
docs/cli_usage.md	CLI reference: index, query, eval, demo
docs/evaluation_rubric.md	Rubric formulae, metrics, judge-model choice
docs/training.md	Embedder + reranker fine-tuning guide
docs/qlora_training.md	LLM QLoRA workflow + memory notes
docs/kaggle_setup.md	Reproducing the runs on Kaggle
results/ablation/ablation_report_final.md	Headline ablation results

Citation

If you reference this work, please cite:

@misc{karacay2026deai,
  author = {Karaçay, Anıl},
  title  = {DeAI: Decision AI for Turkish Legal Question Answering},
  year   = {2026},
  url    = {https://github.com/anilkaracay/DeAI}
}

License

Released under an Academic Use license. The corpus is provided by a third-party data provider and is subject to their terms; do not redistribute the corpus separately. Code and trained adapter weights may be used for academic and research purposes with attribution. Base model weights remain under their original licenses (Qwen3-4B-Instruct: Apache 2.0; bge-m3: MIT; jina-reranker-v2: Apache 2.0).

Acknowledgments

Dataset providers: For the gold benchmark, training data, and the rubric design that drove the ablation methodology.
Base models: BAAI (bge-m3), Jina AI (jina-reranker-v2), Alibaba Qwen Team (Qwen3-4B-Instruct).
Training infrastructure: Unsloth (memory-efficient QLoRA), sentence-transformers, Hugging Face Transformers, Kaggle (free T4 access).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeAI — Decision AI for Turkish Legal Question Answering

Highlights

System Overview

Quick Start

Running with Your Own Data

Install

1. Index your own corpus

2. Query (with full generation)

3. Evaluate on your own benchmark

Repository Structure

Datasets

Training

Evaluation

Reproducibility

Documentation

Citation

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
configs		configs
data		data
docs		docs
notebooks		notebooks
results		results
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
conftest.py		conftest.py
pyproject.toml		pyproject.toml
requirements-inference.txt		requirements-inference.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

DeAI — Decision AI for Turkish Legal Question Answering

Highlights

System Overview

Quick Start

Running with Your Own Data

Install

1. Index your own corpus

2. Query (with full generation)

3. Evaluate on your own benchmark

Repository Structure

Datasets

Training

Evaluation

Reproducibility

Documentation

Citation

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages