A Contextual RAG Bot Framework
Contextual RAG (Retrieval-Augmented Generation) is an improved method for enhancing AI models with external knowledge, introduced by Anthropic. According to the article, it offers significant advantages over traditional RAG approaches:
-
Contextual Embeddings: Instead of simply embedding chunks of text, Contextual RAG prepends chunk-specific explanatory context before embedding. This helps preserve important contextual information that might otherwise be lost.
-
Contextual BM25: The method also applies context to the BM25 indexing process, improving lexical matching.
-
Performance Improvements: Combining Contextual Embeddings and Contextual BM25 reduced the top-20-chunk retrieval failure rate by 49% (from 5.7% to 2.9%) compared to traditional RAG methods.
-
Reranking: When combined with a reranking step, the retrieval failure rate was further reduced by 67% (from 5.7% to 1.9%).
-
Preservation of Context: Unlike traditional RAG, which often loses context when splitting documents into chunks, Contextual RAG maintains important contextual information, leading to more accurate and relevant retrievals.
-
Efficient Implementation: Anthropic's approach allows for cost-effective implementation using prompt caching, making it feasible for large-scale applications.
The article suggests that Contextual RAG is particularly effective for handling complex, multi-turn conversations and large contexts, where traditional RAG methods often struggle. This improvement in retrieval accuracy directly translates to better performance in downstream tasks, making it a significant advancement in the field of AI-powered information retrieval and generation.
Source: Anthropic - Introducing Contextual Retrieval
- Contextual Embeddings: Prepends chunk-specific explanatory context before embedding, preserving important contextual information.
- Contextual BM25: Applies context to the BM25 indexing process, improving lexical matching.
- Performance Improvements: Reduces retrieval failure rates significantly compared to traditional RAG methods.
- Preservation of Context: Maintains important contextual information when splitting documents into chunks.
- Efficient Implementation: Allows for cost-effective implementation using prompt caching.
Contextual RAG is particularly effective for handling complex, multi-turn conversations and large contexts, where traditional RAG methods often struggle.
graph TD
A[User Query] --> B[Contextual Retrieval]
B --> C[Local Knowledge Base]
B --> D[Web Search]
C --> E[Contextual Embeddings]
C --> F[Contextual BM25]
D --> E
D --> F
E --> G[Combined Retrieval Results]
F --> G
G --> H[Neural Reranking]
H --> I[Top K Most Relevant Chunks]
I --> J[Answer Generation]
A --> J
J --> K[Contextual Response]
style B fill:#f9f,stroke:#333,stroke-width:4px
style E fill:#bbf,stroke:#333,stroke-width:2px
style F fill:#bbf,stroke:#333,stroke-width:2px
style H fill:#bfb,stroke:#333,stroke-width:2px
style J fill:#fbf,stroke:#333,stroke-width:4px
-
Local Knowledge Base
- Store and manage documents in a vector database (Chroma)
- Add, remove, and update documents in the knowledge base
- Efficient retrieval using similarity search
-
Web Search Integration
- Incorporate up-to-date information from the web
- Use DuckDuckGo for privacy-focused web searches
-
Contextual Embeddings
- Generate embeddings using Ollama's local LLM
- Consider both query and document context for improved relevance
-
Contextual BM25 Scoring
- Implement a custom BM25 algorithm that considers context
- Score both local documents and web search results
-
Neural Reranking
- Use Ollama to rerank results based on relevance to the query and context
- Combine vector similarity, BM25 scores, and neural reranking for optimal results
-
Answer Generation
- Generate comprehensive answers using Ollama
- Consider both local knowledge and web search results
-
Command-line Interface
- Interactive mode for querying and managing the knowledge base
- Option to list knowledge base contents
- Uses Chroma as the underlying vector database
- Implements add, remove, update, and query operations
- Utilizes Ollama to generate embeddings
- Considers both the text and its context
- Custom implementation of the BM25 algorithm
- Dynamically updates with new documents for each query
- Integrates DuckDuckGo for web searches
- Returns relevant snippets and URLs
- Uses Ollama to score the relevance of results
- Considers query, context, and document content
- Leverages Ollama to generate comprehensive answers
- Incorporates information from top-ranked results
- Orchestrates the entire retrieval and generation process
- Combines local knowledge, web search, and various scoring mechanisms
-
Install dependencies:
pip install -r requirements.txt -
Run the main script:
python main.py -
Use command-line options:
- List knowledge base contents:
python main.py --list_kb
- List knowledge base contents:
-
In interactive mode:
- Upload documents to the knowledge base
- Query the system
- List knowledge base contents
- Exit the program
- Implement document chunking for handling larger texts
- Add support for multiple vector stores and LLM providers
- Improve error handling and logging
- Develop a web interface for easier interaction
- Implement caching mechanisms for improved performance
This project is licensed under the MIT License