-
Notifications
You must be signed in to change notification settings - Fork 13
Add temporality awareness to openrag responses and chunk creation #130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
- openrag/utils/temporal.py TemporalQueryNormalizer class extracts temporal filters Date patterns recognition in multiple languages Relative time Extraction - openrag/components/indexer/chunker/chunker.py Chunkers now add an indexed_at timestamp to documents It is expected that indexed documents provide a created_at timestamp if available - openrag/components/indexer/vectordb/vectordb.py Milvus schema updated to include created_at and indexed_at fields Added Temporal filtering support in vector database queries - openrag/components/retriever.py & pipeline.py Added temporal_filter parameter to all retrievers Automatic temporal extraction from queries via TemporalQueryNormalizer Injects current UTC datetime into system prompt - openrag/components/reranker.py Reranker now combines relevance and temporal scores using a linear decay formula - RERANKER_TEMPORAL_WEIGHT (default 0.3) - RERANKER_TEMPORAL_DECAY_DAYS (default 365)
Added extraction for "modified_at" field in indexation Added "datetime" metadata field as preferred field for date information
Added formatted prompt logging in DEBUG mode Fixed db search with date filters to use OR logic between date fields
| 4. Temporal Awareness | ||
| * Pay attention to the **temporal context** of both the query and the retrieved documents. | ||
| * Each document includes **creation_date** and **indexed_date** metadata indicating when it was created and indexed. | ||
| * When the user asks about **recent events**, **latest updates**, or uses temporal references (e.g., "last week", "yesterday", "this year"), prioritize documents with **more recent dates**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reformulation proposition with less redundancy: When the query includes temporal references (e.g., "last week", "yesterday", "this year"), prioritize documents with **more recent dates**.
| self.relative_number_pattern = r'(\d+)\s*\w+|\w+\s+(\d+)' | ||
|
|
||
| # English patterns for backward compatibility | ||
| self.english_patterns = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming seems incorrect, it sounds more something like common_languages_patterns?
| return self._get_last_n_days(number) | ||
| else: | ||
| # Large number, likely days | ||
| return self._get_last_n_days(number) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This heuristic seems a bit risky to me; what if the query is something like 5 years or 12 months? We'll fall in days, right? Likewise, if I have a query like "summarize the documents mentioning 7 eleven acquisition" , we'll take it as a time query for 7 days, right?
I feel like we could have a lot of false positive here
API changes: