Skip to content

A personalized movie recommendation system powered by PySpark using collaborative filtering to deliver spot-on suggestions based on user behavior . Built for scale. Made for binge-watchers.

Notifications You must be signed in to change notification settings

gnevercodes/PySparkFlicks_MovieRecommender

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

25 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Personalized Movie Recommendation System using PySpark & Collaborative Filtering

Project 1 of 6 | Pushed as part of my academic + real-world ML portfolio

Overview

In the ever-growing jungle of streaming content, users often get lost in endless scrolls and mediocre suggestions. Our project dives into solving this problem by building a personalized movie recommendation system powered by collaborative filtering and Apache Spark, capable of processing massive datasets and giving spot-on suggestions based on user behavior.

Key Features

  • Personalized suggestions based on user-item interaction
  • Built with PySpark on Apache Spark for large-scale performance
  • Evaluated using RMSE, precision, and recall
  • Scalable, fast, and adaptable to various streaming platforms
  • Acknowledges bias and privacy issues in recommender systems

Tech Stack

  • Language: Python
  • Frameworks: PySpark, Apache Hadoop (HDFS)
  • Tools: MLlib, Jupyter, VS Code
  • Algorithm: User-based Collaborative Filtering

Dataset

  • Contains over 8,000+ user interactions and movie ratings
  • Publicly sourced, includes diverse genres, languages, and release years
  • Preprocessing steps include handling nulls, normalization, and outlier removal

Dataset

This project uses the (https://www.kaggle.com/datasets/grouplens/movielens-20m-dataset) which contains millions of user-movie interactions, ratings, and metadata.

For quick testing, a sample dataset (netflix_titles.csv) is included in the /data folder.

To use the full dataset:

  1. Sign in to Kaggle
  2. Download the dataset from the link above
  3. Place it in the root directory or update the path in the code accordingly

Results

  • Achieved RMSE = 3.7725 on our baseline implementation
  • Compared with benchmark paper achieving RMSE = 1.0742
  • Insights into how parameter tuning (lambda, iterations, rank) affects performance

πŸ” Research & References

We’ve drawn inspiration and technical strategies from key works including:

_For the full IEEE-style paper, check the documenation folder in this repo :)

Authors & Credits

Built by a team of graduate students as part of our coursework under the guidance of our incredible supervisor (see acknowledgments in paper). Shoutout to all contributors and cited researchers!

Future Work

  • Incorporating hybrid models (content + collaborative)
  • Introducing privacy-preserving mechanisms
  • Deploying the system on a cloud platform for live inference

License

feel free to fork, star, and remix with credit!

πŸ“ Project Structure

PySparkFlicks_MovieRecommender/

|--- code/                  β†’ PySpark code and scripts
β”œβ”€β”€  notebooks/             β†’ Jupyter Notebooks for exploration
β”œβ”€β”€  data/                  β†’ Sample Netflix dataset
β”œβ”€β”€  documentation/         β†’ IEEE paper, diagrams, and references
β”œβ”€β”€  .github/workflows/     β†’ CI/CD workflows (Python)
β”œβ”€β”€  requirements.txt       β†’ Python dependencies
β”œβ”€β”€  setup.py               β†’ Installable package setup (optional)
β”œβ”€β”€  README.md              β†’ This very file
└──  LICENSE                β†’ Open-source license

About

A personalized movie recommendation system powered by PySpark using collaborative filtering to deliver spot-on suggestions based on user behavior . Built for scale. Made for binge-watchers.

Topics

Resources

Stars

Watchers

Forks