udacity-data-engineering-projects

Few sample projects related to Udacity Data Engineer Program including Data modeling in Postgres & Apache Cassandra, Setting up a Cloud Data Warehouse, Creating a data lake using Spark & Data pipeline Setup using Apache Airflow

Project 1: Data Modeling with Postgres

In this project, we apply Data Modeling with Postgres and build an ETL pipeline using Python. A startup wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. Currently, they are collecting data in json format and the analytics team is particularly interested in understanding what songs users are listening to.

Link: Data_Modeling_with_Postgres

Project 2: Data Modeling with Cassandra

In this project, we apply Data Modeling with Cassandra and build an ETL pipeline using Python. We will build a Data Model around our queries that we want to get answers for. For our use case we want below answers:

Get details of a song that was herad on the music app history during a particular session.
Get songs played by a user during particular session on music app.
Get all users from the music app history who listened to a particular song.

Link : Data_Modeling_with_Apache_Cassandra

Project 3: Data Warehouse with Amazon Redshift

In this project, we'll apply what you've learned on data warehouses and AWS to build an ETL pipeline for a database hosted on Redshift. For our use case we want below answers:

Create tables structures based on the star schema with a fact table 'songplays' and dimension tables user_dim, songs_dim, artists_dim, time_dim
Build ETL pipelines to load these fact & dimension tables
Document the purpose of this database and the dimensional modeling of it.

Link : Data_Warehouse with Amazon Redshift

Project 4: Data Lake with Spark

In this project, we'll apply what you've learned on Spark and data lakes to build an ETL pipeline for a data lake hosted on S3. To complete the project, you will need to load data from S3, process the data into analytics tables using Spark, and load them back into S3. You'll deploy this Spark process on a cluster using AWS. For our use case we want below answers:

Load data from S3 bucket
Process the data into analytics tables using Spark
Load them back into another S3 bucket.

Link : Data_Lake with Spark & AWS S3

Project 5: Data Pipelines with Apache Airflow

To complete the project, we need to create our own custom operators to perform tasks such as staging the data, filling the data warehouse, and running checks on the data as the final step.

We need to update the following four template operators provided with the steps required.

stage_redshift.py - Steps for loading data from S3 bucket to Amazon Redshift
load_fact.py - Steps for loading the fact table songplays_fact from the staging table stage_songs
load_dimension.py - Steps for loading four dimension tables such as dim_users, dim_time, dim_songs and dim_artists
data_quality.py - Steps for checking the Data Quality of the loaded tables

You'll be provided with a helpers class that contains all the SQL transformations. Thus, you won't need to write the ETL yourselves, but you'll need to execute it with your custom operators.

Link : Data_Pipelines with Apache Airflow

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
Capstone Project		Capstone Project
Data Modeling with Apace Cassandra		Data Modeling with Apace Cassandra
Data Modeling with Postgres		Data Modeling with Postgres
Data Pipelines with Apache Airflow		Data Pipelines with Apache Airflow
Datalake with Spark & AWS S3		Datalake with Spark & AWS S3
Datawarehouse with AWS Redshift		Datawarehouse with AWS Redshift
images		images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

udacity-data-engineering-projects

Project 1: Data Modeling with Postgres

Project 2: Data Modeling with Cassandra

Project 3: Data Warehouse with Amazon Redshift

Project 4: Data Lake with Spark

Project 5: Data Pipelines with Apache Airflow

About

Uh oh!

Releases

Packages

Languages

karthigaiselvanm/udacity-data-engineering-projects

Folders and files

Latest commit

History

Repository files navigation

udacity-data-engineering-projects

Project 1: Data Modeling with Postgres

Project 2: Data Modeling with Cassandra

Project 3: Data Warehouse with Amazon Redshift

Project 4: Data Lake with Spark

Project 5: Data Pipelines with Apache Airflow

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages