AI-Powered Educational Rap Video Generation Pipeline

About

Automated pipeline that turns an educational topic into a short-form rap video for platforms such as Instagram Reels and TikTok. The system combines lyric generation, voice synthesis, music generation, talking head video synthesis, subtitle alignment, and final video rendering into one workflow.

Demo: Instagram Post
Hackathon: Hackathon: AI in Consumer

Goal: make knowledge content more engaging, memorable, and easier to consume through short, entertaining videos.

Overview

This project explores how multimodal AI can be used to generate educational media automatically. Given a topic and an instructor persona, the pipeline produces:

AI-generated educational rap lyrics
synthesized vocal audio
generated background music
a talking head video with lip-synced animation
word-level animated subtitles
a final edited short-form video

The heavy compute parts of the pipeline are designed to run on Modal, using GPU-backed infrastructure for model inference and video generation.

Demo

You can view a sample output here: Instagram Demo

You can also include project visuals in this repository:

⭐ My Contributions

This repository was developed as a team project. My main contributions focused on the Python pipeline and inference workflow.

My work included:

building and maintaining core Python scripts for the generation pipeline
integrating Modal for remote GPU inference
adapting and running inference based on Real3D-Portrait
connecting the steps across lyrics, audio, talking head generation, subtitles, and final rendering
helping structure the end-to-end workflow so the system could produce short educational rap videos from prompts

Pipeline

The workflow consists of several stages:

1. Lyric Generation

An LLM generates short, catchy educational rap lyrics from a user-provided topic.

2. Voice Synthesis

The lyrics are converted into vocal audio using a text-to-speech system.

3. Music Generation

A background beat is generated to match the style and pacing of the vocals.

4. Talking Head Synthesis

A talking portrait video is generated using Real3D-Portrait, with lip movements synchronized to the generated audio.

5. Subtitle Alignment

The audio is transcribed with whisper-timestamped to obtain word-level timestamps for subtitle animation.

6. Final Rendering

The final short-form video is rendered with subtitles, audio, and generated visuals using video processing tools.

Tech Stack

This project integrates several tools and models:

Python
Modal for remote GPU inference
Real3D-Portrait for talking head generation
whisper-timestamped for word-level subtitle timing
FFmpeg for media processing
MoviePy for video assembly and rendering
Docker for reproducible environment setup

Model and Infrastructure

Talking Head Generation

This project uses the model from the ICLR 2024 Spotlight paper: Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis Paper Link

Cloud Compute

The main inference workflow runs on Modal, which allows the pipeline to use containerized GPU environments on demand.

Reproducibility

A Dockerfile is included to make the software environment more reproducible across systems.

Repository Structure

Knowunity_project/
├── Dockerfile
├── README.md
├── .gitignore
├── docs/
│   └── images/
│       ├── idea.png
│       └── workflow.png
├── data/
│   ├── input/
│   │   ├── audio/
│   │   │   ├── examples/
│   │   │   ├── reference/
│   │   │   ├── rap_nomusic_66s.wav
│   │   │   └── rap_with_music_66s.mp3
│   │   ├── images/
│   │   │   └── examples/
│   │   └── video/
│   │       ├── examples/
│   │       └── kendrick14s.mp4
│   ├── processed/
│   │   ├── audio/
│   │   │   └── rap_with_music_66s_16khz.wav
│   │   └── video/
│   │       └── kendrick14s_512x512.mp4
│   └── sample_output/
├── notebooks/
│   ├── add_subtitles.ipynb
│   └── knowunity-project_Poyen.ipynb
├── output/
│   ├── audio/
│   ├── text/
│   ├── subtitles/
│   └── video/
├── logs/
├── src/
│   ├── helpers/
│   ├── add_subtitles_modal.py
│   ├── generate_content.py
│   ├── post_to_instagram.py
│   ├── preprocess_data.py
│   └── run_modal.py
└── tests/
    ├── test_repo_structure.py
    └── test_readme.py

Setup

Clone your fork of the repository:

git clone https://github.com/isthatgopro/Knowunity_project.git
cd Knowunity_project

Install Modal:

pip install modal

Authenticate Modal:

modal token new

Make sure your Modal account has the required secrets configured, such as a Hugging Face token if needed by the model pipeline.

How to Run

This project is designed so that local scripts orchestrate the workflow while the heavier inference steps run remotely on Modal.

Step 1: Preprocess input data

Prepare the input audio and video files:

python src/preprocess_data.py \
  --input-dir data/input \
  --output-dir data/processed \
  --audio-file your_audio.mp3 \
  --video-file your_video.mp4

This step converts raw input files into formats suitable for downstream inference.

Step 2: Generate the talking head video

Run the main Modal pipeline:

python src/run_modal.py \
  --src-img data/input/images/your_source_image.png \
  --drv-aud data/processed/audio/your_audio_16khz.wav \
  --drv-pose data/processed/video/your_video_512x512.mp4 \
  --bg-img data/input/images/your_background.png \
  --out-name my_video.mp4

This step generates the main video output and saves it to output/.

Step 3: Add dynamic subtitles

Generate word-level subtitles and render the final shareable video:

python src/add_subtitles_modal.py \
  --input-video output/video/my_video.mp4 \
  --output-video output/video/my_video_with_subs.mp4 \
  --gpu H100 \
  --model medium

The final output will be saved in the output/ directory.

Tests

This repository includes lightweight checks for project structure and documentation.

Run tests with:

pip install pytest
pytest -q

Example Use Case

A user provides:

a subject, such as a math concept
an instructor persona or character style
supporting media assets if needed

The system then generates a short educational rap video that explains the topic through lyrics, audio, and a talking head presentation.

Limitations

This project is currently a prototype and has several limitations:

output quality depends on prompt quality and input assets
subtitle timing and lip sync quality may vary
generated media may still require manual review
the workflow is built primarily for experimentation and demos rather than production deployment
some steps depend on external APIs, model availability, or cloud setup

What I Learned

Through this project, I gained hands-on experience with:

building and debugging a multimodal generation pipeline
orchestrating remote GPU inference with Modal
integrating research models into a working application pipeline
handling practical issues in audio, video, and subtitle processing
designing a project that connects machine learning outputs to a user-facing demo

Acknowledgements

The talking head generation component is based on the work of the Real3D-Portrait authors.
Cloud GPU infrastructure was provided through Modal
This repo reflects a collaborative team project, with my main contributions centered on Python pipeline development and inference integration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI-Powered Educational Rap Video Generation Pipeline

About

Overview

Demo

⭐ My Contributions

Pipeline

1. Lyric Generation

2. Voice Synthesis

3. Music Generation

4. Talking Head Synthesis

5. Subtitle Alignment

6. Final Rendering

Tech Stack

Model and Infrastructure

Talking Head Generation

Cloud Compute

Reproducibility

Repository Structure

Setup

How to Run

Step 1: Preprocess input data

Step 2: Generate the talking head video

Step 3: Add dynamic subtitles

Tests

Example Use Case

Limitations

What I Learned

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
archive/notebooks		archive/notebooks
data		data
docs/images		docs/images
src		src
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

AI-Powered Educational Rap Video Generation Pipeline

About

Overview

Demo

⭐ My Contributions

Pipeline

1. Lyric Generation

2. Voice Synthesis

3. Music Generation

4. Talking Head Synthesis

5. Subtitle Alignment

6. Final Rendering

Tech Stack

Model and Infrastructure

Talking Head Generation

Cloud Compute

Reproducibility

Repository Structure

Setup

How to Run

Step 1: Preprocess input data

Step 2: Generate the talking head video

Step 3: Add dynamic subtitles

Tests

Example Use Case

Limitations

What I Learned

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages