This template provides a simple, local Retrieval-Augmented Generation (RAG) system for your Data Engineering documents. It ingests PDFs, Markdown, Text files, and Jupyter Notebooks into a local vector database (ChromaDB) for querying.
- Multi-format Support: Handles
.pdf,.md,.txt, and.ipynbfiles. - Local Embeddings: Uses
sentence-transformers/all-MiniLM-L6-v2(runs entirely on your CPU/GPU, no API keys required). - Persistent Storage: Saves the vector database locally in
./chroma_db. - Easy Querying: Simple CLI to ask questions against your document set.
-
Install Dependencies:
pip install -r requirements.txt
-
Prepare Your Documents: Place your documents in a folder (e.g.,
data/my_docs). -
Ingest Data: Run the ingestion script to process your documents and build the database.
python ingest.py --source_dir /path/to/your/documents
-
Query the Knowledge Base: Ask questions about your data.
python query.py "What are the best practices for dbt macros?"
See requirements.txt.