Automated pipeline that turns an educational topic into a short-form rap video for platforms such as Instagram Reels and TikTok. The system combines lyric generation, voice synthesis, music generation, talking head video synthesis, subtitle alignment, and final video rendering into one workflow.
-
Demo: Instagram Post
-
Hackathon: Hackathon: AI in Consumer
Goal: make knowledge content more engaging, memorable, and easier to consume through short, entertaining videos.
This project explores how multimodal AI can be used to generate educational media automatically. Given a topic and an instructor persona, the pipeline produces:
- AI-generated educational rap lyrics
- synthesized vocal audio
- generated background music
- a talking head video with lip-synced animation
- word-level animated subtitles
- a final edited short-form video
The heavy compute parts of the pipeline are designed to run on Modal, using GPU-backed infrastructure for model inference and video generation.
You can view a sample output here: Instagram Demo
You can also include project visuals in this repository:
This repository was developed as a team project. My main contributions focused on the Python pipeline and inference workflow.
My work included:
- building and maintaining core Python scripts for the generation pipeline
- integrating Modal for remote GPU inference
- adapting and running inference based on Real3D-Portrait
- connecting the steps across lyrics, audio, talking head generation, subtitles, and final rendering
- helping structure the end-to-end workflow so the system could produce short educational rap videos from prompts
The workflow consists of several stages:
An LLM generates short, catchy educational rap lyrics from a user-provided topic.
The lyrics are converted into vocal audio using a text-to-speech system.
A background beat is generated to match the style and pacing of the vocals.
A talking portrait video is generated using Real3D-Portrait, with lip movements synchronized to the generated audio.
The audio is transcribed with whisper-timestamped to obtain word-level timestamps for subtitle animation.
The final short-form video is rendered with subtitles, audio, and generated visuals using video processing tools.
This project integrates several tools and models:
- Python
- Modal for remote GPU inference
- Real3D-Portrait for talking head generation
- whisper-timestamped for word-level subtitle timing
- FFmpeg for media processing
- MoviePy for video assembly and rendering
- Docker for reproducible environment setup
This project uses the model from the ICLR 2024 Spotlight paper: Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis Paper Link
The main inference workflow runs on Modal, which allows the pipeline to use containerized GPU environments on demand.
A Dockerfile is included to make the software environment more reproducible across systems.
Knowunity_project/
├── Dockerfile
├── README.md
├── .gitignore
├── docs/
│ └── images/
│ ├── idea.png
│ └── workflow.png
├── data/
│ ├── input/
│ │ ├── audio/
│ │ │ ├── examples/
│ │ │ ├── reference/
│ │ │ ├── rap_nomusic_66s.wav
│ │ │ └── rap_with_music_66s.mp3
│ │ ├── images/
│ │ │ └── examples/
│ │ └── video/
│ │ ├── examples/
│ │ └── kendrick14s.mp4
│ ├── processed/
│ │ ├── audio/
│ │ │ └── rap_with_music_66s_16khz.wav
│ │ └── video/
│ │ └── kendrick14s_512x512.mp4
│ └── sample_output/
├── notebooks/
│ ├── add_subtitles.ipynb
│ └── knowunity-project_Poyen.ipynb
├── output/
│ ├── audio/
│ ├── text/
│ ├── subtitles/
│ └── video/
├── logs/
├── src/
│ ├── helpers/
│ ├── add_subtitles_modal.py
│ ├── generate_content.py
│ ├── post_to_instagram.py
│ ├── preprocess_data.py
│ └── run_modal.py
└── tests/
├── test_repo_structure.py
└── test_readme.py
Clone your fork of the repository:
git clone https://github.com/isthatgopro/Knowunity_project.git
cd Knowunity_project
Install Modal:
pip install modal
Authenticate Modal:
modal token new
Make sure your Modal account has the required secrets configured, such as a Hugging Face token if needed by the model pipeline.
This project is designed so that local scripts orchestrate the workflow while the heavier inference steps run remotely on Modal.
Prepare the input audio and video files:
python src/preprocess_data.py \
--input-dir data/input \
--output-dir data/processed \
--audio-file your_audio.mp3 \
--video-file your_video.mp4
This step converts raw input files into formats suitable for downstream inference.
Run the main Modal pipeline:
python src/run_modal.py \
--src-img data/input/images/your_source_image.png \
--drv-aud data/processed/audio/your_audio_16khz.wav \
--drv-pose data/processed/video/your_video_512x512.mp4 \
--bg-img data/input/images/your_background.png \
--out-name my_video.mp4
This step generates the main video output and saves it to output/.
Generate word-level subtitles and render the final shareable video:
python src/add_subtitles_modal.py \
--input-video output/video/my_video.mp4 \
--output-video output/video/my_video_with_subs.mp4 \
--gpu H100 \
--model medium
The final output will be saved in the output/ directory.
This repository includes lightweight checks for project structure and documentation.
Run tests with:
pip install pytest
pytest -q
A user provides:
- a subject, such as a math concept
- an instructor persona or character style
- supporting media assets if needed
The system then generates a short educational rap video that explains the topic through lyrics, audio, and a talking head presentation.
This project is currently a prototype and has several limitations:
- output quality depends on prompt quality and input assets
- subtitle timing and lip sync quality may vary
- generated media may still require manual review
- the workflow is built primarily for experimentation and demos rather than production deployment
- some steps depend on external APIs, model availability, or cloud setup
Through this project, I gained hands-on experience with:
- building and debugging a multimodal generation pipeline
- orchestrating remote GPU inference with Modal
- integrating research models into a working application pipeline
- handling practical issues in audio, video, and subtitle processing
- designing a project that connects machine learning outputs to a user-facing demo
- The talking head generation component is based on the work of the Real3D-Portrait authors.
- Cloud GPU infrastructure was provided through Modal
- This repo reflects a collaborative team project, with my main contributions centered on Python pipeline development and inference integration

