doc-reader

A Django-based RAG document Q&A system for large text corpora.

This project lets you upload long documents, index them into chunks, retrieve relevant context with vector search, and generate answers grounded in those retrieved sections. It includes a web UI, REST API, CLI, and an experimental semantic coherence layer that checks whether retrieved context and generated answers stay meaningfully aligned.

What it does

Ingests PDF, DOCX, TXT, and Markdown documents
Chunks and indexes large documents for retrieval
Answers questions over indexed content through:
- a Django web interface
- REST API endpoints
- a CLI
Supports conversational querying
Tracks semantic coherence across retrieval and generation
Includes an experimental Azure-based pipeline alongside the standard local/OpenAI flow

Why this project exists

This repo was built around a practical long-document retrieval problem: asking useful questions over very large documents, including book-length text. The focus is less on “chat with a PDF” and more on building a system that can handle long inputs, retrieval quality issues, and uncertainty more explicitly.

Core features

Document ingestion

Supports PDF, DOCX, TXT, and Markdown
Configurable chunk size and chunk overlap
Designed to handle very large documents

Query interfaces

Django web app
REST API
Command-line interface
Conversational mode for follow-up questions

Retrieval

FAISS-based vector retrieval
Configurable top-k retrieval
Optional local embeddings via sentence-transformers
Optional OpenAI embeddings

Semantic coherence validation

After retrieval and answer generation, the system compares embeddings across:

query → retrieved chunks
retrieved chunks → generated answer
query → generated answer

If coherence drops, the system can:

increase retrieval depth
hedge the output language
flag low-confidence answers

This is intended to make failure modes more visible instead of silently returning overconfident answers.

Architecture

Standard pipeline

Upload or add documents
Extract and chunk text
Generate embeddings
Store chunks in a vector index
Retrieve top-k chunks for a question
Generate an answer from retrieved context
Run semantic coherence validation on the result

Experimental Azure pipeline

The repo also includes an experimental Azure-native path using:

Azure OpenAI
Azure AI Search
Azure Document Intelligence
optional Azure Key Vault / Storage integration

This path is still experimental and should be treated as a separate integration track rather than the default setup.

Tech stack

Python
Django + Django REST Framework
LangChain
FAISS
OpenAI or local sentence-transformer embeddings
Optional Azure OpenAI / AI Search / Document Intelligence

Quick start

1. Clone the repo

git clone https://github.com/djleamen/doc-reader
cd doc-reader

2. Create a virtual environment

python -m venv venv
source venv/bin/activate
# Windows:
# venv\Scripts\activate

3. Install dependencies

pip install -r requirements.txt

4. Configure environment variables

cp .env.example .env

For the standard pipeline, set at least:

OPENAI_API_KEY=your_key_here

For local embeddings, enable:

USE_LOCAL_EMBEDDINGS=true

For the Azure pipeline, fill in the Azure-specific settings from .env.example.

5. Start the app

python main.py start

Then open:

http://localhost:8000

Usage

Web UI

Upload documents and ask questions through the browser.

API

curl -X POST "http://localhost:8000/api/upload-documents/" \
  -F "files=@document.pdf" \
  -F "index_name=default"

curl -X POST "http://localhost:8000/api/query/" \
  -H "Content-Type: application/json" \
  -d '{"question": "What is the main topic?", "index_name": "default"}'

CLI

python main.py cli add document.pdf
python main.py cli query "What are the key findings?"
python main.py cli interactive --conversational

Configuration

Important settings include:

VECTOR_DB_TYPE=faiss
CHUNK_SIZE=1000
CHUNK_OVERLAP=200
TOP_K_RESULTS=5
CHAT_MODEL=gpt-4-turbo-preview

ENABLE_COHERENCE_VALIDATION=True
COHERENCE_HIGH_THRESHOLD=0.8
COHERENCE_LOW_THRESHOLD=0.4
BOOST_K_MULTIPLIER=2.0

Project status

This is a working RAG application with multiple interfaces and an experimental retrieval-quality layer. The standard pipeline is the main path. The Azure pipeline is included as an experimental integration.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 148 Commits
.devcontainer		.devcontainer
.github		.github
django_app		django_app
rag_app		rag_app
src		src
templates		templates
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
main.py		main.py
manage.py		manage.py
requirements.txt		requirements.txt
setup.py		setup.py
start.sh		start.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

doc-reader

What it does

Why this project exists

Core features

Document ingestion

Query interfaces

Retrieval

Semantic coherence validation

Architecture

Standard pipeline

Experimental Azure pipeline

Tech stack

Quick start

1. Clone the repo

2. Create a virtual environment

3. Install dependencies

4. Configure environment variables

5. Start the app

Usage

Web UI

API

CLI

Configuration

Project status

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

doc-reader

What it does

Why this project exists

Core features

Document ingestion

Query interfaces

Retrieval

Semantic coherence validation

Architecture

Standard pipeline

Experimental Azure pipeline

Tech stack

Quick start

1. Clone the repo

2. Create a virtual environment

3. Install dependencies

4. Configure environment variables

5. Start the app

Usage

Web UI

API

CLI

Configuration

Project status

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages