An end-to-end retrieval-augmented generation (RAG) system for Turkish legal QA, with domain-adapted dense embeddings, cross-encoder reranking, and QLoRA-tuned generation.
π― Ablation Results Β· π οΈ CLI Usage Β· π Evaluation Rubric
Three independently fine-tuned components evaluated under the rubric's
Scenario 1 formula F = 0.35Β·R + 0.4Β·(AΒ·R) + 0.25Β·(GΒ·R) on a
240-question instructor gold benchmark:
| Metric | Baseline | Full Pipeline | Ξ |
|---|---|---|---|
| Retrieval R = (Recall@10 + MRR@10) / 2 | 0.7352 | 0.8483 | +11.3 pts |
| Reranker held-out F1 | 0.331 | 0.911 | +57.9 pts |
| Answer F1 (vs. gold) | 0.406 | 0.452 | +11.3% relative |
| Citation compliance | 68% | 78% | +10 pp |
| Scenario 1 Final F | 0.6078 | 0.6130 | +0.9% |
Question
β
βΌ
βββββββββββββββββββ
β Embedding Layer β β BAAI/bge-m3 (fine-tuned)
β (Hybrid Search) β Dense + Sparse + ColBERT
ββββββββββ¬βββββββββ
β top-k passages
βΌ
βββββββββββββββββββ
β Reranker β β Jina Reranker v2 (fine-tuned)
β (Cross-Encoder) β Token-level query-doc interaction
ββββββββββ¬βββββββββ
β top-n reranked
βΌ
βββββββββββββββββββ
β LLM Generator β β Qwen3-4B (QLoRA fine-tuned)
β (RAG Prompting) β Citation-aware answer generation
ββββββββββ¬βββββββββ
β
βΌ
Grounded Answer
with [1][2] citations
The query is encoded by bge-m3, which produces both dense and sparse
representations for hybrid retrieval. FAISS then returns the top-k
candidate chunks ranked by inner product over L2-normalized vectors
(equivalent to cosine similarity). A cross-encoder reranker re-scores the
candidates using full token-level attention between the query and each
passage, which is more accurate than the bi-encoder dot product but too
expensive to apply at corpus scale. Finally, the LLM generates an answer
grounded in the reranked passages and emits inline citation markers
(e.g., [1][2]) that point back to the source chunks.
Three stages, three independently fine-tuned components:
- Dense retrieval β
bge-m3fine-tuned with hard-negative MNRL on 1,853 instructor-supplied (query, positive, hard-negative) triples. - Cross-encoder reranking β
jina-reranker-v2-base-multilingualfine-tuned with binary cross-entropy on 6,076 (query, candidate, relevance) triples. - Grounded generation β
Qwen3-4B-Instruct-2507adapted via QLoRA (r=16, all attention + MLP modules, 33M trainable params) on 4,900 supervised QA examples withKaynak:citation format.
All training fits on a single Tesla T4 (15.6 GB VRAM); total training time ~2 hours.
# Install
pip install -r requirements.txt
# Index a corpus (use instructor data or your own JSONL)
python scripts/deai_cli.py index --corpus data/instructor/corpus.jsonl \
--output-dir data/processed/faiss_index
# Ask a question
python scripts/deai_cli.py query "KΔ±dem tazminatΔ± hakkΔ± ne zaman doΔar?"
# Run the full ablation evaluation
python scripts/deai_cli.py eval --benchmark data/instructor/gold_benchmark.json
# Launch the interactive Gradio demo
python scripts/demo.pyThe CLI accepts custom corpora and custom benchmarks (rubric requirements 3 and 4) β see docs/cli_usage.md.
The system loads all three fine-tuned models from the Hugging Face Hub (no local weights needed):
- Embedder:
anilkaracay/bge-m3-legal-tr - Reranker:
anilkaracay/jina-reranker-legal-tr - LLM adapter:
anilkaracay/qwen3-4b-legal-tr-qlora(onQwen/Qwen3-4B-Instruct-2507)
pip install -r requirements.txt
pip install -r requirements-inference.txt # torch, peft, accelerate (GPU recommended)Corpus format (JSONL, one document per line):
{"id": "doc_1", "text": "...", "title": "...", "metadata": {}}python scripts/deai_cli.py index \
--corpus your_corpus.jsonl \
--output-dir your_index \
--embedding-model anilkaracay/bge-m3-legal-trpython scripts/deai_cli.py query "your question" \
--index-dir your_index --generateBenchmark format (JSON array). Supported schema:
[{"question_id": "q1", "question": "...", "verified_answer": "...",
"gold_sources": [{"corpus_row_id": "doc_1"}]}]python scripts/deai_cli.py eval \
--benchmark your_benchmark.json \
--index-dir your_index \
--output-dir eval_results \
--top-k 5 --use-judgeThis runs the 5-config ablation and writes a JSON + Markdown report with Scenario 1 scores (R, A, G, final) per config.
GPU note: the LLM and reranker stages require a CUDA GPU. On CPU-only machines, retrieval-only configs still produce metrics; LLM configs degrade gracefully.
DeAI/
βββ configs/ # OmegaConf YAML configs for each training stage
β βββ embedding_ft.yaml
β βββ reranker_ft.yaml
β βββ llm_ft.yaml
βββ data/
β βββ instructor/ # Provided benchmark (corpus + train + gold)
β βββ gold_qa/ # Internal evaluation sets
βββ docs/ # All user-facing documentation
β βββ cli_usage.md # CLI reference
β βββ evaluation_rubric.md # Scenario 1/2/3 formulae and metrics
β βββ training.md # Embedding + reranker fine-tuning guide
β βββ qlora_training.md # LLM QLoRA workflow + memory estimates
β βββ kaggle_setup.md # Reproducing the training runs on Kaggle
βββ results/
β βββ ablation/ # All evaluation outputs (JSON + Markdown)
βββ scripts/
β βββ deai_cli.py # Unified CLI (index/query/eval/demo)
β βββ demo.py # Gradio interactive demo
β βββ run_baseline.py # Reference baseline run
βββ src/
β βββ data/ # Loaders, chunking, Turkish-aware cleaning
β βββ retrieval/ # FAISS + BM25 hybrid index
β βββ training/ # Trainers for embedder/reranker/LLM
β βββ evaluation/ # Metrics, LLM-as-judge, rubric, ablation runner
β βββ pipeline.py # End-to-end RAG composition
βββ tests/ # 88 passing tests
The project is built on the ORICON Turkish legal corpus, a curated dataset package provided by a third-party data provider. It bundles a pre-chunked document collection together with matching training and evaluation splits for every stage of the RAG pipeline.
The files live under data/instructor/ and are excluded from version
control (see .gitignore); a regenerated statistics summary is checked
in at data/instructor/instructor_data_report.md.
| File | Records | Purpose |
|---|---|---|
corpus.jsonl |
~7,579 chunks | Document collection (the indexed corpus) |
embedding.jsonl |
~2,059 pairs | Query / positive / hard-negative triples for embedding contrastive fine-tuning |
reranker.jsonl |
~6,752 triples | Query / candidate / label data for cross-encoder reranker fine-tuning |
llm.jsonl |
~13,758 examples | Chat-format supervised fine-tuning data for the generator LLM |
gold_benchmark.json |
~240 questions | Manually verified Q-A-relevant_docs gold benchmark |
rag_eval.json |
~1,000 queries | Broader retrieval evaluation set with gold chunk ids and citation labels |
Run python scripts/explore_instructor_data.py to regenerate the
statistics report and verify that the files load with the expected schema.
Custom corpus and benchmark inputs are supported via
scripts/deai_cli.py β the pipeline can be
pointed at any user-provided dataset (JSONL, JSON, CSV, or plain text).
See docs/cli_usage.md for the full guide.
Each component is trained independently. Hyperparameters and detailed walkthroughs:
- Embedder (
bge-m3, MNRL): docs/training.md β 17 min on T4 - Reranker (
jina-v2, BCE): docs/training.md β 8 min on T4 - LLM (
Qwen3-4B, QLoRA): docs/qlora_training.md β 99 min on T4
To reproduce on Kaggle, follow docs/kaggle_setup.md.
The instructor's rubric defines three scenarios depending on available gold data; with our (question, answer, relevant-docs) gold benchmark we score under Scenario 1:
F = 0.35Β·R + 0.4Β·(AΒ·R) + 0.25Β·(GΒ·R)
where R = retrieval (Recall@10 + MRR@10) / 2, A = token-level F1 against the verified gold answer, G = LLM-as-judge faithfulness (normalized to [0, 1]). A and G are gated by R so that confident wrong answers over wrong evidence are penalized.
Full ablation across 5 cumulative configurations (baseline β +embed β +rerank β +base LLM β +QLoRA LLM):
β results/ablation/ablation_report_final.md
Methodology details, rubric mathematics, and judge-model rationale: docs/evaluation_rubric.md.
- Hardware: NVIDIA Tesla T4 (15.6 GB VRAM), single GPU, Kaggle free tier
- Software: PyTorch 2.10.0, Transformers 4.57.3, Unsloth 2025.12.10, sentence-transformers, bitsandbytes (4-bit NF4 throughout)
- Random seed: 42 (data shuffle, train split, evaluation subset)
- Total compute: ~2 hours training + ~1 hour evaluation
- Frozen artifacts: trained adapter weights are available on request
All training was run on Kaggle public notebooks; configurations and step-by-step setup at docs/kaggle_setup.md.
| Document | Purpose |
|---|---|
| docs/cli_usage.md | CLI reference: index, query, eval, demo |
| docs/evaluation_rubric.md | Rubric formulae, metrics, judge-model choice |
| docs/training.md | Embedder + reranker fine-tuning guide |
| docs/qlora_training.md | LLM QLoRA workflow + memory notes |
| docs/kaggle_setup.md | Reproducing the runs on Kaggle |
| results/ablation/ablation_report_final.md | Headline ablation results |
If you reference this work, please cite:
@misc{karacay2026deai,
author = {KaraΓ§ay, AnΔ±l},
title = {DeAI: Decision AI for Turkish Legal Question Answering},
year = {2026},
url = {https://github.com/anilkaracay/DeAI}
}Released under an Academic Use license. The corpus is provided by a third-party data provider and is subject to their terms; do not redistribute the corpus separately. Code and trained adapter weights may be used for academic and research purposes with attribution. Base model weights remain under their original licenses (Qwen3-4B-Instruct: Apache 2.0; bge-m3: MIT; jina-reranker-v2: Apache 2.0).
- Dataset providers: For the gold benchmark, training data, and the rubric design that drove the ablation methodology.
- Base models: BAAI (bge-m3), Jina AI (jina-reranker-v2), Alibaba Qwen Team (Qwen3-4B-Instruct).
- Training infrastructure: Unsloth (memory-efficient QLoRA), sentence-transformers, Hugging Face Transformers, Kaggle (free T4 access).