Skip to content

Archdiner/GraphRAG-Dodd-Frank

Repository files navigation

πŸ“œ Dodd-Frank Act Querying Using GraphRAG and Natural Language

This is my project folder which contains the project I've worked on to allow for complex querying of the Dodd-Frank Act using Microsoft's GraphRAG. In this readme file I'll outline the following details:

Purpose and Goals of the Project

The aim of this project was to implement an app that allows a user to ask questions about the Dodd-Frank Act using natural language and receive accurate and relevant answers.

Additionally, this project aimed to explore the capabilities of Microsoft's new GraphRAG technology and see how it compares to standard 'naive RAG' applications.

Given these aims, I outlined the following goals for this project:

Key Goals and Objectives

Goal Description
1 Create a comprehensive knowledge graph of the Dodd Frank Act using GraphRAG.
2 Implement global and local search functionalities.
3 Develop a user-friendly UI.
4 Ensure response accuracy and text grounding.
5 Compare GraphRAG with Naive RAG responses.

These goals clarified the scope of this project and outlined my next steps as I took this project on.


Initial Research and Analysis

Before deciding to implement my knowledge graph using GraphRAG, I conducted some research on different potential methods of implementing knowledge graphs. I found the two most promising and practical methods for this case were LangChain with Neo4j and Microsoft's GraphRAG.

Comparison of Neo4j + LangChain with GraphRAG πŸ”„

LangChain + Neo4j Benefits:

  • Customization: High control over schema design and data relationships.
  • Integration: Supports various backends, offering flexibility for hybrid setups.
  • Community: Strong support and documentation.

GraphRAG Benefits:

  • Automation: Uses LLMs for automated graph construction from large datasets.
  • Hierarchical Structure: Builds a structured graph with community summaries.
  • Integration: Seamless with Azure services, offering robust cloud support.
  • Ease of Use: Quick deployment with a solution accelerator, minimal setup.

For more information, visit the GraphRAG Documentation and the Neo4j Documentation.

Ultimately however, I decided to use GraphRAG due to the time constraints I faced, as well as my lack of knowledge that is required to fully take advantage of the flexibility offered by Neo4j.

Required Packages

All the required packages for this project can be found in the requirements.txt file.
To install all these packages, use the command:

pip install -r requirements.txt

I recommend doing all this within a new virtual environment.

Ingesting the Data

In this stage we build the knowledge graph. From the initial data we begin with chunking the text and generating embeddings for each chunk. From these embeddings GraphRAG then extracts key entites and relationships between these entities from the data. From these entities and relationships GraphRAG then extracts communities and generates community reports, where each community contains semantically similar entities and relationships to form this hierarchical knowledge graph.

This was a very challenging part of the process. To get started, I initially followed the guide by Microsoft on getting started with GraphRAG, but made a few key changes.

  1. Custom Chunking Method: In the settings.yaml file, I overrode the default character text splitter and used a recursive character text splitter to obtain better chunks of text.

  2. JSON Decode Errors: During the final community report generation phase, I frequently encountered JSON Decode Errors. To solve this, I adjusted the prompts in the prompts folder and patched the prompt in the venvy folder in graphrag\index\graph\extractors\community_reports\prompts.py to ensure the JSONs used as examples were correctly formatted.

  3. Configuration: Changed values in the settings.yaml file to allow for parallelization whilst staying within my model's token limit, as well as configuring the file overall to suit my purposes.

To generate my final knowledge graph through GraphRAG it took me a total of 4 hours 11 minutes in the end, and once I handled all the configuration and errors I ran into during the process it was relatively straight forward.

The key output from this process can be found in the output\20240721-094231\artifacts folder, the parquet files which contain the data of the knowledge graph.


After generating the knowledge Graph I was able to visualize it in a jupyter notebook with yfiles-jupyter-graphs as seen in visualize.ipnyb following this example. This is the visualization I got: alt text

Performing GraphRAG

Here is some key information about how GraphRAG works first.

Global vs Local Search

Global Search is used when the query requires a broad understanding and connections from various communities. It's useful for queries that are not highly specific and benefit from a wider context. Increasing the community level allows for more granular connections to be identified.
Local Search focuses on a specific subgraph or individual communities within the knowledge graph. Provides more detail when searching for specifics, has more similarities with standard RAG.

Implementing Global Search

To implement global search as seen in my global_search.py file I used the Global Search Guide by Microsoft, adjusting the values and parameters to suit my model and implementation better.

The global search implementation utilizes a map-reduce methodology: alt text

  • Map Response: The system splits the dataset into chunks and generates intermediate responses for each chunk, highlighting key points with relevance scores derived from the related community reports.

  • Refine (Reduce) Response: It aggregates these intermediate responses, filters, and ranks them based on importance, creating a comprehensive final answer.

This approach allows GraphRAG to handle complex queries by synthesizing detailed responses from large dataset.

Implementing Local Search

To implement local search as seen in my local_search.py file, I followed the process detailed in the Local Search Guide by Microsoft, adapting the parameters to fit my specific model and implementation.

The local search implementation uses a focused and streamlined approach:

alt text

  • Search Response: The system combines structured data from the AI-extracted knowledge graph with relevant text chunks from the raw documents. It identifies a set of entities related to the user query, extracting associated details like connected entities, relationships, and community reports.

  • Summarize Response: The system synthesizes a concise response from the identified relevant documents and entities, ensuring that the final answer is accurate and directly related to the query.

Methodology

  1. Entity Extraction: The system identifies entities from the knowledge graph that are semantically related to the user query.
  2. Data Mapping: These entities are mapped to relevant text units, community reports, relationships, and covariates (if applicable) from the dataset.
  3. Prioritization and Filtering: The candidate data sources are ranked and filtered to fit within the context window, ensuring only the most relevant information is included.
  4. Response Generation: The filtered data is used to generate a comprehensive response to the user query.

With these 2 different search methods setup and fully functional I was ready to pair my GraphRAG setup with a UI.

Creating a UI

In this section, I outline the different approaches I tried for creating the UI. It should be noted in both cases I used the global search program. The goal was to implement a chatbot that allows for natural language querying of the Dodd-Frank Act, the ability to view the entire Dodd-Frank Act in PDF form on the side of the screen, and the ability to click on and view sources from the generated output on the side of the screen.

Streamlit

Main app functionality: alt text

Streamlit offers the most customizability and flexibility when creating a Dashboard of the methods I tested, while still being relatively easy to implement. My Streamlit implementation can be found in the streamlit_app folder. Ultimately, the following reasons led to me not choosing Streamlit for my final implementation:

  • Advantages:

    • Highly customizable and flexible.
    • Easy to implement for dashboard creation.
  • Disadvantages:

    • Less responsive UI.
    • Frequent lags and slow performance during use.
    • Source citation feature was difficult to implement and left incomplete.
    • More challenging to implement PDF viewer and chatbot, resulting in less clean outputs.

Chainlit

Main app functionality: alt text Source citation: alt text The code for my chainlit implementation can be found in the chainlit_app folder. Despite Streamlit offering more flexibility and customizability, Chainlit specializes in chatbots, providing a much better and practical UI for my use case. Here are the reasons why I decided to use Chainlit in the end:

  • Advantages:

    • Highly responsive and clean UI with a similar feel to ChatGPT.
    • No instances of lag or UI problems.
    • PDF viewer and chatbot were very easy to implement and looked great.
    • Source citation works as intended and was easier to implement.
  • Disadvantages:

    • Less customizable compared to Streamlit.

Conclusion

In conclusion, while Streamlit provided more flexibility and customization options, the performance issues and implementation difficulties led me to choose Chainlit. Chainlit's responsiveness, ease of implementation for the PDF viewer and chatbot, and effective source citation feature made it the ideal choice for my project.

Comparison with StandardRAG

In order to compare my GraphRAG with StandardRAG, I first created a simple StandardRAG application using Langchain and its built in RetrievalQA chain, as well as a FAISS vector store.

I then tested a variety of different questions and documented the different kinds of responses I got from both GraphRAG and StandardRAG. This research data as well as my standard RAG implementation can be found in the data_research folder, you can check it out for more details about these responses.

Here is a brief summary of the key differences between the responses:

  • RegularRAG Responses:

    • Focus on detailed legal analysis and specific provisions
    • Dense and technical, with heavy use of legal jargon
    • Fragmented structure, often lacking a cohesive approach
    • Less practical and real-world applicable
    • Less contextual awareness
  • GraphRAG Responses:

    • Provide comprehensive overviews with key aspects
    • Well-organized with clear sections and headings
    • Highly practical and applicable to real-world scenarios
    • Include practical examples, case studies, and actionable steps

Conclusion:

GraphRAG responses are generally more user-friendly and practical, offering clear, actionable guidance and a cohesive narrative. RegularRAG responses, while thorough in legal detail, can be dense and less contextually, making them harder to apply in real-world scenarios.

Next Steps

Given the limited timeframe of the project, there are still features and ideas to explore which I haven't had the time to yet, but are certainly worth considering moving forward.

Professional Feedback

I was fortunate to have the chance to showcase my findings and project to an individual who helped craft the legislation over a decade ago, and ask for subject matter expertise on the Dodd-Frank Act. Based on the information I received from the meeting, here are some potential improvements:

  • Handle Slang Financial Terminology: Ensure the GraphRAG system can interpret and respond to 'slang' financial terminology in queries.
  • Direct Act References: Instead of linking community reports, provide relevant sections and titles from the Dodd-Frank Act in the responses.
  • Connect Business Lines and Act Sections: Enable the GraphRAG system to draw connections between business lines and relevant sections of the Dodd-Frank Act, along with specific and accurate compliance guidance.
  • Incorporate Relevant Regulations and Cases: Alongside the Dodd-Frank Act, provide relevant regulations and potentially related federal cases in the responses.
  • Update Knowledge Graph for Future Regulations: Develop a method to gradually update the knowledge graph to account for future regulations that may alter or invalidate parts of the Act.
  • Compare with International Bodies: Allow for comparisons with international bodies such as MiFID, outlining key differences and similarities for various scenarios. The next level would be enabling queries about merging regulations and their impact on business lines.

Potential Technical Improvements

Based on my observations, here are some areas where I believe more can be done:

  • Improve Citations: Enable citations directly from the Dodd-Frank Act rather than community reports.
  • Adaptive Search Mechanism: Automatically use either local or global search depending on the query. Analyze the query and determine which would provide a better answer.
  • Enhance UI: Develop a more professional UI beyond Chainlit. Incorporate tabs to switch between the Dodd-Frank PDF on the side and sources.
  • Evaluation: Provide a means of evaluating the search responses provided by the system, output a score of some sort to determine quality of response.

In summary, while significant progress has been made, there are several areas for further enhancement that could greatly improve the usability and functionality of the system.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages