Indeed scraping for chatbot interaction

This repo contain the project code for the course "Technologies for Big Data Management" held by professor Massimo Callisto at the university of Camerino.

Introduction

The scope of this project is to dynamically scrape reviews from Indeed in order to embed them in the context of a chatbot. This chatbot will then communicate with the user and it will be able to answer questions about said company. For example, a user might be interested in the salary level of a certain company but maybe doesn't want to read thousands of reviews to find that out. This project aims to find a solution just to that. The application scrapes data from the web, then embeds it into a vector database in order for them to be used later when the user asks a question.

Technologies

Docker

Docker is a platform designed to help developers build, package, and deploy applications in containers. Containers allow applications to be portable and consistent across different environments by encapsulating them with all their dependencies. Docker containers provide lightweight virtualization, improving development workflows and infrastructure consistency. It simplifies the setup and scaling of environments, especially in CI/CD pipelines and microservices architectures. Docker is widely used to manage the lifecycle of applications and improve deployment efficiency.

Apache Kafka

Apache Kafka is a publish-subscribe messaging solution designed for real-time data streaming and distributed pipelines. It excels at high-performance data integration, streaming analytics, and data feed replaying. Kafka servers store data streams as records within a cluster, ensuring durability and scalability. Kafka’s Streams API allows developers to process, filter, and aggregate real-time data streams to build sophisticated streaming applications. Its versatility has made it essential for building real-time applications in various industries.

Apache Spark

Apache Spark is a powerful distributed processing engine that handles large-scale data processing across clusters in real-time and batch modes. With its in-memory computing capabilities, Spark delivers high-speed processing of big data. It integrates well with various data storage solutions such as HDFS, Cassandra, and S3, and provides APIs for different programming languages, including Python, Java, and Scala. Apache Spark’s structured streaming and machine learning libraries make it a popular choice for processing big data in real time and advanced analytics.

ChromaDB

ChromaDB is a vector database designed to handle high-dimensional data such as embeddings used in AI applications. It supports operations such as searching, clustering, and organizing vectorized data efficiently. As machine learning models and LLMs often produce large volumes of embedded data, ChromaDB provides a scalable solution to store and query this data in real-time. Its efficient handling of embeddings makes it a core component of many AI-based workflows.

Elastic and Kibana

Elastic (Elasticsearch) is a distributed, RESTful search engine designed for large volumes of data, such as logs or metrics. It is commonly used for full-text search, analytics, and monitoring. Elasticsearch is part of the Elastic Stack, which includes Kibana for visualizing search results and managing queries. Together, Elastic and Kibana provide a comprehensive solution for real-time search and analytics across datasets, making them popular in log analysis, infrastructure monitoring, and business analytics use cases.

LLMs

Large Language Models (LLMs) are AI models designed to understand and generate human language. With advancements in deep learning and transformer architectures, LLMs such as GPT-3, BERT, and others have demonstrated impressive capabilities in tasks like text generation, summarization, translation, and more. LLMs are integral to many NLP applications, helping businesses automate processes, build chatbots, enhance search engines, and create personalized user experiences through natural language understanding.

Prerequisites

The prerequisites are:

Python 3.12.4
Docker and docker-compose
Git (to clone this repo)
JDK 17 (to start the spark application)

To download Python 3.12.4 you can follow the guide in the python 3.12.4 installation page and choose the correct version according to your system.

To download Docker you can follow the guide in the docker installation page and choose the correct version according to your system.

To download Git you can follow the guide in the git installation page and choose the correct version according to your system.

To download and install the JDK 17 you can follow the guide in the JDK 17 installation page and choose the correct version according to your system.

Installation & Configuration

Once the prerequisites are correctly installed, we can go on to setup the environments in order to run the project.

First thing first, we must clone the repo and move into the root folder

git clone https://github.com/Meguazy/Review-scraper-chatbot-embedding.git
cd Review-scraper-chatbot-embedding/

We can now install the python dependecies. We suggest to use a virtual environment like pipenv. To install the dependencies you have to run the command

pipenv install -r requirements.txt

this allows you to install all of the Python 3.12.4's dependencies with a single command.

Once the python requirements are installed, you can create the docker container. First, position yourself in the root directory of the project. Then, run the command

docker-compose up -d

This will create and build the pre-defined virtual environment. This environment will handle every service by itself, without needing the user to manually setup anything. Everything is already handled and pre-defined inside the Dockerfile, docker-compose.yaml, .env file and chroma_configs/ directory.

Usage

To use the program, first move into the virtual environment shell

pipenv shell

This will open the virtual environment so that we can use all of the dependencies that we've previously installed.

Then we need to move inside the 'src/' folder and set the PYTHONPATH

cd src/
export PYTHONPATH=$(pwd)

IMPORTANT: those first two steps must be repeated every time you open a new terminal.

Before taking a look at the scraper, we first need to start the two consumers. To do this, we must open two different terminals in order to be able to interact with them. In the first terminal, run the commands

chmod +x codeKafka/submit.sh
bash codeKafka/submit.sh

We can then go on the Spark main page to monitor the job's execution.

In the second terminal, run the command

python codeKafka/consumer_elastic.py

In the first terminal we will start the consumer that writes the data on the vector database, while in the second terminal we will interact with the one that writes on the ElasitSearch indexes.

To start the scraper and the chatbot app, we need to start the streamlit webapp on yet another terminal by using the command

streamlit run app.py

By going on the streamlit main page, the app will present itself like this

In the first section you can enter a link to scrape, like the one given by example. While scraping, the entire webapp will be blocked and cannot be used until the application has completed the process. In the second one, we can ask the chatbot a question. We can do this even without scraping any sites if the information is already present inside the vector database. To correctly use the chatbot, we have to keep in mind two things:

The name of the company must match the one present in the Indeed link. For example, if we want to scrape Poste Italiane and the company name in the link is "Poste-Italiane" we must put that inside of the field;
The number of results should be chosen carefully. More results do not automatically mean a better chatbot response. For example, if a company has 20 reviews and we choose 10 results for the query we will probably get reviews that have nothing in common with the question we've asked. This happens because the sample of reviews isn't big enough. On the other hand, if a company has thousands of review putting a number that is too low could exclude some important context from the question. Also keep in mind that chatbots do not perform well with prompts that are too complex and lengthy.

The following is an example of how to use the chatbot:

Finally, the user can access the indexes with Elastic and Kibana and create dashboard to analyze the data by accessing the kibana main page.

Results

Chatbot Response

The following is a response example from the chatbot, based on the prompt described above

As we can see, we are able to visualize both the prompt and the answer.

Sentiment Analysis

We've constructed two dashboards in Kibana, the first one represents the sentiment analysis. This was computed using the python package text_blob. The first kind of plot is a pie chart containing the percentages of the sentiment like the following:

The second plot is an heatmap that describes the polarities of the companies

Word Cloud

We also defined a word cloud so that we are able to see the most used words in the reviews and gain usefule insights. The following is an example for Fastweb:

License

Indeed scraping for chatbot interaction is available under the MIT license

Contact Information

Contact	Mail
Francesco Finucci	francesco.finucci@studenti.unicam.it
Andrea Palmieri	andrea03.palmieri@studenti.unicam.it

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
chroma_configs		chroma_configs
logos		logos
src		src
.env		.env
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Indeed scraping for chatbot interaction

Table of Contents:

Introduction

Technologies

Docker

Apache Kafka

Apache Spark

ChromaDB

Elastic and Kibana

LLMs

Prerequisites

Installation & Configuration

Usage

Results

Chatbot Response

Sentiment Analysis

Word Cloud

License

Contact Information

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

Meguazy/Review-scraper-chatbot-embedding

Folders and files

Latest commit

History

Repository files navigation

Indeed scraping for chatbot interaction

Table of Contents:

License

Contact Information

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Uh oh!

Languages