Personalized Movie Recommendation System using PySpark & Collaborative Filtering

Project 1 of 6 | Pushed as part of my academic + real-world ML portfolio

Overview

In the ever-growing jungle of streaming content, users often get lost in endless scrolls and mediocre suggestions. Our project dives into solving this problem by building a personalized movie recommendation system powered by collaborative filtering and Apache Spark, capable of processing massive datasets and giving spot-on suggestions based on user behavior.

Key Features

Personalized suggestions based on user-item interaction
Built with PySpark on Apache Spark for large-scale performance
Evaluated using RMSE, precision, and recall
Scalable, fast, and adaptable to various streaming platforms
Acknowledges bias and privacy issues in recommender systems

Tech Stack

Language: Python
Frameworks: PySpark, Apache Hadoop (HDFS)
Tools: MLlib, Jupyter, VS Code
Algorithm: User-based Collaborative Filtering

Dataset

Contains over 8,000+ user interactions and movie ratings
Publicly sourced, includes diverse genres, languages, and release years
Preprocessing steps include handling nulls, normalization, and outlier removal

Dataset

This project uses the (https://www.kaggle.com/datasets/grouplens/movielens-20m-dataset) which contains millions of user-movie interactions, ratings, and metadata.

For quick testing, a sample dataset (netflix_titles.csv) is included in the /data folder.

To use the full dataset:

Sign in to Kaggle
Download the dataset from the link above
Place it in the root directory or update the path in the code accordingly

Results

Achieved RMSE = 3.7725 on our baseline implementation
Compared with benchmark paper achieving RMSE = 1.0742
Insights into how parameter tuning (lambda, iterations, rank) affects performance

🔍 Research & References

We’ve drawn inspiration and technical strategies from key works including:

_For the full IEEE-style paper, check the documenation folder in this repo :)

Authors & Credits

Built by a team of graduate students as part of our coursework under the guidance of our incredible supervisor (see acknowledgments in paper). Shoutout to all contributors and cited researchers!

Future Work

Incorporating hybrid models (content + collaborative)
Introducing privacy-preserving mechanisms
Deploying the system on a cloud platform for live inference

License

feel free to fork, star, and remix with credit!

📁 Project Structure

PySparkFlicks_MovieRecommender/

|--- code/                  → PySpark code and scripts
├──  notebooks/             → Jupyter Notebooks for exploration
├──  data/                  → Sample Netflix dataset
├──  documentation/         → IEEE paper, diagrams, and references
├──  .github/workflows/     → CI/CD workflows (Python)
├──  requirements.txt       → Python dependencies
├──  setup.py               → Installable package setup (optional)
├──  README.md              → This very file
└──  LICENSE                → Open-source license

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Personalized Movie Recommendation System using PySpark & Collaborative Filtering

Overview

Key Features

Tech Stack

Dataset

Dataset

Results

🔍 Research & References

Authors & Credits

Future Work

License

📁 Project Structure

About

Uh oh!

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github/workflows		.github/workflows
code		code
documentation		documentation
README.md		README.md

gnevercodes/PySparkFlicks_MovieRecommender

Folders and files

Latest commit

History

Repository files navigation

Personalized Movie Recommendation System using PySpark & Collaborative Filtering

Overview

Key Features

Tech Stack

Dataset

Dataset

Results

🔍 Research & References

Authors & Credits

Future Work

License

📁 Project Structure

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages