Skip to content

Conversation

@zxdxjtu
Copy link

@zxdxjtu zxdxjtu commented Jul 8, 2025

Add Google Gemini TTS Integration

Summary

This PR adds support for Google Gemini Text-to-Speech (TTS) API, providing users with 15 high-quality voice options for video
narration. The integration enables users who only have Gemini API keys to leverage Google's advanced TTS capabilities
without requiring additional API services.

What's Changed

✨ New Features

  • Google Gemini TTS Integration: Added gemini_tts() function in app/services/voice.py with proper Linear PCM audio handling
  • 15 Voice Options: Support for diverse voices including:
    • Female voices: Zephyr, Kore, Fennel, Aoede
    • Male voices: Puck, Charon, Krypton, Orion, Pegasus
    • Each voice has unique characteristics suitable for different content styles
  • WebUI Integration: Added "Google Gemini TTS" option in audio settings dropdown

🐛 Bug Fixes

  • Audio Format Issue: Fixed critical bug where Gemini API response was incorrectly processed as base64, resulting in
    0.07-second audio files
  • PCM Audio Handling: Properly handle Linear PCM format (24kHz, mono, 16-bit) returned by Gemini API
  • Dependency Management: Made faster_whisper optional to improve compatibility

🔧 Technical Details

  • Gemini API returns raw PCM audio bytes, not base64 encoded data
  • Audio is converted to MP3 format for compatibility with video processing pipeline
  • Integrated with existing SubMaker for proper subtitle synchronization
  • Maintains compatibility with existing TTS providers (Edge TTS, Azure, etc.)

Testing

  • ✅ Tested with various text lengths (5-55 characters)
  • ✅ Verified both English and Chinese text synthesis
  • ✅ Confirmed proper audio duration (1-12 seconds for test cases)
  • ✅ Full video generation pipeline tested successfully

How to Use

  1. Add your Gemini API key to config.toml:
    gemini_api_key = "your-api-key-here"
  2. Select "Google Gemini TTS" in WebUI audio settings
  3. Choose from 15 available voices (e.g., "gemini:Kore-Female")
  4. Generate videos with high-quality Gemini narration

Dependencies

  • google-generativeai (for Gemini API)
  • pydub (for audio processing)
  • No additional dependencies required

Related Documentation

zxdxjtu added 2 commits July 8, 2025 10:39
- Add gemini_tts() function with proper PCM audio handling
- Support 15 Gemini voices (Zephyr, Puck, Kore, etc.)
- Fix audio data format issue preventing video generation
- Add Gemini TTS option to WebUI settings
- Update .gitignore to exclude debug files
- Add try/except import for faster_whisper
- Gracefully handle missing dependency with warning
- Prevents import errors on systems without faster_whisper
@zxdxjtu zxdxjtu marked this pull request as draft July 8, 2025 02:51
@zxdxjtu zxdxjtu marked this pull request as ready for review July 8, 2025 02:52
@alperktt
Copy link

generating audio

2025-10-27 14:29:22 | INFO | "./app\services\voice.py:1472": gemini_tts - start, voice name: Zephyr, try: 1
2025-10-27 14:29:22 | ERROR | "./app\services\voice.py:1557": gemini_tts - Gemini TTS failed, error: Protocol message GenerationConfig has no "response_modalities" field.
2025-10-27 14:29:22 | ERROR | "./app\services\task.py:84": generate_audio - failed to generate audio:

  1. check if the language of the voice matches the language of the video script.
  2. check if the network is available. If you are in China, it is recommended to use a VPN and enable the global traffic mode.
    2025-10-27 14:29:22 | ERROR | "./webui\Main.py:971": - Video Generation Failed
    2025-10-27 14:29:45 | INFO | "./app\services\voice.py:1472": gemini_tts - start, voice name: Kore, try: 1
    2025-10-27 14:29:45 | ERROR | "./app\services\voice.py:1557": gemini_tts - Gemini TTS failed, error: Protocol message GenerationConfig has no "response_modalities" field.
    2025-10-27 14:29:45 | INFO | "./app\services\voice.py:1472": gemini_tts - start, voice name: Kore, try: 1
    2025-10-27 14:29:45 | ERROR | "./app\services\voice.py:1557": gemini_tts - Gemini TTS failed, error: Protocol message GenerationConfig has no "response_modalities" field.

Im not in china but?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants