Skip to content

anilkaracay/DeAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

44 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DeAI β€” Decision AI for Turkish Legal Question Answering

An end-to-end retrieval-augmented generation (RAG) system for Turkish legal QA, with domain-adapted dense embeddings, cross-encoder reranking, and QLoRA-tuned generation.

Python License Hardware

🎯 Ablation Results Β· πŸ› οΈ CLI Usage Β· πŸ“Š Evaluation Rubric

Highlights

Three independently fine-tuned components evaluated under the rubric's Scenario 1 formula F = 0.35Β·R + 0.4Β·(AΒ·R) + 0.25Β·(GΒ·R) on a 240-question instructor gold benchmark:

Metric Baseline Full Pipeline Ξ”
Retrieval R = (Recall@10 + MRR@10) / 2 0.7352 0.8483 +11.3 pts
Reranker held-out F1 0.331 0.911 +57.9 pts
Answer F1 (vs. gold) 0.406 0.452 +11.3% relative
Citation compliance 68% 78% +10 pp
Scenario 1 Final F 0.6078 0.6130 +0.9%

System Overview

Question
  β”‚
  β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Embedding Layer β”‚  ← BAAI/bge-m3 (fine-tuned)
β”‚  (Hybrid Search) β”‚     Dense + Sparse + ColBERT
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚  top-k passages
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚    Reranker      β”‚  ← Jina Reranker v2 (fine-tuned)
β”‚  (Cross-Encoder) β”‚     Token-level query-doc interaction
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚  top-n reranked
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   LLM Generator  β”‚  ← Qwen3-4B (QLoRA fine-tuned)
β”‚  (RAG Prompting) β”‚     Citation-aware answer generation
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
   Grounded Answer
   with [1][2] citations

The query is encoded by bge-m3, which produces both dense and sparse representations for hybrid retrieval. FAISS then returns the top-k candidate chunks ranked by inner product over L2-normalized vectors (equivalent to cosine similarity). A cross-encoder reranker re-scores the candidates using full token-level attention between the query and each passage, which is more accurate than the bi-encoder dot product but too expensive to apply at corpus scale. Finally, the LLM generates an answer grounded in the reranked passages and emits inline citation markers (e.g., [1][2]) that point back to the source chunks.

Three stages, three independently fine-tuned components:

  1. Dense retrieval β€” bge-m3 fine-tuned with hard-negative MNRL on 1,853 instructor-supplied (query, positive, hard-negative) triples.
  2. Cross-encoder reranking β€” jina-reranker-v2-base-multilingual fine-tuned with binary cross-entropy on 6,076 (query, candidate, relevance) triples.
  3. Grounded generation β€” Qwen3-4B-Instruct-2507 adapted via QLoRA (r=16, all attention + MLP modules, 33M trainable params) on 4,900 supervised QA examples with Kaynak: citation format.

All training fits on a single Tesla T4 (15.6 GB VRAM); total training time ~2 hours.

Quick Start

# Install
pip install -r requirements.txt

# Index a corpus (use instructor data or your own JSONL)
python scripts/deai_cli.py index --corpus data/instructor/corpus.jsonl \
  --output-dir data/processed/faiss_index

# Ask a question
python scripts/deai_cli.py query "Kıdem tazminatı hakkı ne zaman doğar?"

# Run the full ablation evaluation
python scripts/deai_cli.py eval --benchmark data/instructor/gold_benchmark.json

# Launch the interactive Gradio demo
python scripts/demo.py

The CLI accepts custom corpora and custom benchmarks (rubric requirements 3 and 4) β€” see docs/cli_usage.md.

Running with Your Own Data

The system loads all three fine-tuned models from the Hugging Face Hub (no local weights needed):

Install

pip install -r requirements.txt
pip install -r requirements-inference.txt   # torch, peft, accelerate (GPU recommended)

1. Index your own corpus

Corpus format (JSONL, one document per line):

{"id": "doc_1", "text": "...", "title": "...", "metadata": {}}
python scripts/deai_cli.py index \
  --corpus your_corpus.jsonl \
  --output-dir your_index \
  --embedding-model anilkaracay/bge-m3-legal-tr

2. Query (with full generation)

python scripts/deai_cli.py query "your question" \
  --index-dir your_index --generate

3. Evaluate on your own benchmark

Benchmark format (JSON array). Supported schema:

[{"question_id": "q1", "question": "...", "verified_answer": "...",
  "gold_sources": [{"corpus_row_id": "doc_1"}]}]
python scripts/deai_cli.py eval \
  --benchmark your_benchmark.json \
  --index-dir your_index \
  --output-dir eval_results \
  --top-k 5 --use-judge

This runs the 5-config ablation and writes a JSON + Markdown report with Scenario 1 scores (R, A, G, final) per config.

GPU note: the LLM and reranker stages require a CUDA GPU. On CPU-only machines, retrieval-only configs still produce metrics; LLM configs degrade gracefully.

Repository Structure

DeAI/
β”œβ”€β”€ configs/                  # OmegaConf YAML configs for each training stage
β”‚   β”œβ”€β”€ embedding_ft.yaml
β”‚   β”œβ”€β”€ reranker_ft.yaml
β”‚   └── llm_ft.yaml
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ instructor/           # Provided benchmark (corpus + train + gold)
β”‚   └── gold_qa/              # Internal evaluation sets
β”œβ”€β”€ docs/                     # All user-facing documentation
β”‚   β”œβ”€β”€ cli_usage.md          # CLI reference
β”‚   β”œβ”€β”€ evaluation_rubric.md  # Scenario 1/2/3 formulae and metrics
β”‚   β”œβ”€β”€ training.md           # Embedding + reranker fine-tuning guide
β”‚   β”œβ”€β”€ qlora_training.md     # LLM QLoRA workflow + memory estimates
β”‚   └── kaggle_setup.md       # Reproducing the training runs on Kaggle
β”œβ”€β”€ results/
β”‚   └── ablation/             # All evaluation outputs (JSON + Markdown)
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ deai_cli.py           # Unified CLI (index/query/eval/demo)
β”‚   β”œβ”€β”€ demo.py               # Gradio interactive demo
β”‚   └── run_baseline.py       # Reference baseline run
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ data/                 # Loaders, chunking, Turkish-aware cleaning
β”‚   β”œβ”€β”€ retrieval/            # FAISS + BM25 hybrid index
β”‚   β”œβ”€β”€ training/             # Trainers for embedder/reranker/LLM
β”‚   β”œβ”€β”€ evaluation/           # Metrics, LLM-as-judge, rubric, ablation runner
β”‚   └── pipeline.py           # End-to-end RAG composition
└── tests/                    # 88 passing tests

Datasets

The project is built on the ORICON Turkish legal corpus, a curated dataset package provided by a third-party data provider. It bundles a pre-chunked document collection together with matching training and evaluation splits for every stage of the RAG pipeline.

The files live under data/instructor/ and are excluded from version control (see .gitignore); a regenerated statistics summary is checked in at data/instructor/instructor_data_report.md.

File Records Purpose
corpus.jsonl ~7,579 chunks Document collection (the indexed corpus)
embedding.jsonl ~2,059 pairs Query / positive / hard-negative triples for embedding contrastive fine-tuning
reranker.jsonl ~6,752 triples Query / candidate / label data for cross-encoder reranker fine-tuning
llm.jsonl ~13,758 examples Chat-format supervised fine-tuning data for the generator LLM
gold_benchmark.json ~240 questions Manually verified Q-A-relevant_docs gold benchmark
rag_eval.json ~1,000 queries Broader retrieval evaluation set with gold chunk ids and citation labels

Run python scripts/explore_instructor_data.py to regenerate the statistics report and verify that the files load with the expected schema.

Custom corpus and benchmark inputs are supported via scripts/deai_cli.py β€” the pipeline can be pointed at any user-provided dataset (JSONL, JSON, CSV, or plain text). See docs/cli_usage.md for the full guide.

Training

Each component is trained independently. Hyperparameters and detailed walkthroughs:

To reproduce on Kaggle, follow docs/kaggle_setup.md.

Evaluation

The instructor's rubric defines three scenarios depending on available gold data; with our (question, answer, relevant-docs) gold benchmark we score under Scenario 1:

F = 0.35Β·R + 0.4Β·(AΒ·R) + 0.25Β·(GΒ·R)

where R = retrieval (Recall@10 + MRR@10) / 2, A = token-level F1 against the verified gold answer, G = LLM-as-judge faithfulness (normalized to [0, 1]). A and G are gated by R so that confident wrong answers over wrong evidence are penalized.

Full ablation across 5 cumulative configurations (baseline β†’ +embed β†’ +rerank β†’ +base LLM β†’ +QLoRA LLM):

β†’ results/ablation/ablation_report_final.md

Methodology details, rubric mathematics, and judge-model rationale: docs/evaluation_rubric.md.

Reproducibility

  • Hardware: NVIDIA Tesla T4 (15.6 GB VRAM), single GPU, Kaggle free tier
  • Software: PyTorch 2.10.0, Transformers 4.57.3, Unsloth 2025.12.10, sentence-transformers, bitsandbytes (4-bit NF4 throughout)
  • Random seed: 42 (data shuffle, train split, evaluation subset)
  • Total compute: ~2 hours training + ~1 hour evaluation
  • Frozen artifacts: trained adapter weights are available on request

All training was run on Kaggle public notebooks; configurations and step-by-step setup at docs/kaggle_setup.md.

Documentation

Document Purpose
docs/cli_usage.md CLI reference: index, query, eval, demo
docs/evaluation_rubric.md Rubric formulae, metrics, judge-model choice
docs/training.md Embedder + reranker fine-tuning guide
docs/qlora_training.md LLM QLoRA workflow + memory notes
docs/kaggle_setup.md Reproducing the runs on Kaggle
results/ablation/ablation_report_final.md Headline ablation results

Citation

If you reference this work, please cite:

@misc{karacay2026deai,
  author = {KaraΓ§ay, AnΔ±l},
  title  = {DeAI: Decision AI for Turkish Legal Question Answering},
  year   = {2026},
  url    = {https://github.com/anilkaracay/DeAI}
}

License

Released under an Academic Use license. The corpus is provided by a third-party data provider and is subject to their terms; do not redistribute the corpus separately. Code and trained adapter weights may be used for academic and research purposes with attribution. Base model weights remain under their original licenses (Qwen3-4B-Instruct: Apache 2.0; bge-m3: MIT; jina-reranker-v2: Apache 2.0).

Acknowledgments

  • Dataset providers: For the gold benchmark, training data, and the rubric design that drove the ablation methodology.
  • Base models: BAAI (bge-m3), Jina AI (jina-reranker-v2), Alibaba Qwen Team (Qwen3-4B-Instruct).
  • Training infrastructure: Unsloth (memory-efficient QLoRA), sentence-transformers, Hugging Face Transformers, Kaggle (free T4 access).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages