Skip to content

skamalad/testing_gemini_live

Repository files navigation

Gemini Live API Testing Suite

A comprehensive collection of demos, tools, and experiments for Google's Gemini Live API, featuring real-time audio/video interaction, function calling, and advanced conversation management.

🚀 Overview

This repository contains multiple implementations and testing approaches for the Gemini Live API, ranging from simple proof-of-concepts to production-ready voice assistants with advanced features like:

  • Real-time Audio/Video Processing: Bidirectional audio streaming with camera and screen capture
  • Function Calling: Smart home control, weather queries, and custom tool integration
  • Advanced Conversation Management: Context optimization, session handling, and instruction management
  • Multiple Interface Options: CLI, web-based, and programmatic interfaces
  • Audio Development Kit (ADK) Integration: Professional-grade audio processing capabilities

📁 Project Structure

testing_live_gemini_tool/
├── Core Live API Demos
│   ├── gemini_live_01.py              # Basic audio processing with file I/O
│   ├── gemini_live_02.py              # Simple text-based live session
│   ├── gemini_live_tool_call.py       # Enhanced multimodal demo with tools
│   └── live_simple_cli.py             # Official DeepMind CLI implementation
│
├── Function Calling & Tools
│   ├── function_calling.py            # Smart home voice assistant
│   └── name_correction.py             # Booking management utilities
│
├── Advanced Features
│   ├── adk_audio_to_audio.py          # ADK-based audio processing
│   ├── instruction_optimization.py    # Context management system
│   └── test_minimal.py               # API connection testing
│
├── Web Interface
│   ├── index.tsx                      # TypeScript web component
│   └── simple_websocket_server.py    # WebSocket server
│
├── Configuration & Setup
│   ├── setup.py                      # Automated setup script
│   ├── requirements.txt              # Python dependencies
│   ├── env.txt                       # Environment configuration reference
│   └── .gitignore                    # Git ignore rules
│
├── Assets
│   ├── audio.wav                     # Sample audio output
│   └── sample.wav                    # Sample audio input
│
└── Virtual Environment
    └── testing_live_function_calling/ # Isolated environment for testing

🛠 Setup & Installation

Prerequisites

  • Python 3.8+
  • Google AI Studio API Key
  • Audio devices (microphone/speakers or headphones)
  • Webcam (optional, for video demos)

Quick Setup

  1. Clone and navigate to the repository

    cd testing_live_gemini_tool
  2. Run the automated setup

    python setup.py
  3. Configure your API key

    • Edit the .env file created by setup
    • Add your Gemini API key from Google AI Studio
    GEMINI_API_KEY=your_api_key_here
    

Manual Setup

If you prefer manual installation:

# Install core dependencies
pip install google-genai opencv-python pyaudio pillow mss asyncio websockets

# Install additional dependencies for advanced features
pip install google-adk daily-python pipecat-ai

# Install web dependencies (if using TypeScript components)
npm install @google/genai lit

🎯 Key Features & Demos

1. Basic Audio Processing (gemini_live_01.py)

Simple audio-to-audio processing using the Gemini Live API:

python gemini_live_01.py

Features:

  • Loads audio from sample.wav
  • Processes through Gemini Live API
  • Outputs response to audio.wav
  • Demonstrates basic audio format conversion

2. Interactive Text Session (gemini_live_02.py)

Minimal text-based live session:

python gemini_live_02.py

Features:

  • Simple "Hello" message processing
  • Text-only response modality
  • Demonstrates basic session management

3. Enhanced Multimodal Demo (gemini_live_tool_call.py)

Full-featured demo with audio, video, and function calling:

python gemini_live_tool_call.py --mode camera
# or
python gemini_live_tool_call.py --mode screen

Features:

  • Real-time camera or screen capture
  • Bidirectional audio streaming
  • Function calling capabilities
  • Enhanced logging and conversation management
  • Session statistics and conversation saving
  • Multiple input modes (text + audio + video)

Commands:

  • help - Show available commands
  • stats - Display session statistics
  • save [filename] - Save conversation log
  • clear - Clear screen
  • q or quit - Exit

4. Smart Home Voice Assistant (function_calling.py)

Production-ready voice assistant with smart home integration:

python function_calling.py

Capabilities:

  • Control smart lights (on/off/dim)
  • Get weather information
  • Adjust thermostat settings
  • Natural language processing
  • Audio and text input/output

Example Commands:

  • "Turn on the living room lights"
  • "What's the weather in San Francisco?"
  • "Set the temperature to 72 degrees"
  • "Dim the bedroom lights to 30%"

5. Audio Development Kit Integration (adk_audio_to_audio.py)

Professional-grade audio processing using Google's ADK:

python adk_audio_to_audio.py

Features:

  • Advanced audio processing pipeline
  • Name correction utilities integration
  • Session management with persistent context
  • Order status checking functionality
  • Professional audio quality optimization

6. Web Interface (index.tsx)

TypeScript/Lit-based web component for browser integration:

Features:

  • Browser-based audio recording
  • Real-time audio playback
  • Visual audio indicators
  • WebRTC audio processing
  • Responsive web interface

7. API Testing Suite (test_minimal.py)

Comprehensive API connection testing:

python test_minimal.py

Tests:

  • Multiple Gemini model compatibility
  • API key validation
  • Connection stability
  • Error handling verification

🔧 Advanced Features

Context Optimization (instruction_optimization.py)

Intelligent conversation context management:

Features:

  • Dynamic prompt assembly
  • Session-based context injection
  • Conversation phase tracking
  • Memory-efficient context segmentation

Usage:

from instruction_optimization import SessionMetadata, ConversationPhase

# Create session metadata
session = SessionMetadata(
    user_id="user123",
    session_id="session456",
    current_phase=ConversationPhase.GREETING
)

Name Correction Utilities (name_correction.py)

Booking management and name correction system:

Correction Types:

  • Spelling corrections
  • Name swaps
  • Gender corrections
  • Maiden name changes
  • Title removals

Environment Configuration

Reference your environment variables against env.txt for complete setup:

  • Google Cloud credentials
  • API keys and authentication
  • Audio processing libraries
  • Development dependencies

🌐 API Models Supported

  • gemini-2.0-flash-exp - Latest experimental model
  • gemini-2.0-flash - Production flash model
  • gemini-1.5-flash - Fast processing model
  • gemini-1.5-pro - Advanced reasoning model
  • gemini-2.5-flash-preview-native-audio-dialog - Native audio dialog
  • gemini-live-2.5-flash-preview - Live preview model

🎵 Audio Configuration

Input Audio

  • Format: PCM 16-bit
  • Sample Rate: 16,000 Hz
  • Channels: Mono
  • Chunk Size: 1024 bytes

Output Audio

  • Format: PCM 16-bit
  • Sample Rate: 24,000 Hz
  • Channels: Mono
  • Voice Options: Zephyr, Kore

🔒 Security Best Practices

  1. API Key Protection

    • Store API keys in .env files
    • Never commit API keys to version control
    • Use environment variables in production
  2. Audio Privacy

    • Use headphones to prevent audio feedback
    • Be aware of microphone permissions
    • Consider audio data handling policies
  3. Function Calling Security

    • Validate all function parameters
    • Implement proper error handling
    • Use permission-based access controls

🐛 Troubleshooting

Common Issues

  1. Audio Feedback

    • Solution: Use headphones instead of speakers
    • Cause: Microphone picks up speaker output
  2. API Connection Errors

    • Solution: Verify API key in .env file
    • Check: Internet connection and firewall settings
  3. Module Import Errors

    • Solution: Install missing dependencies with pip install -r requirements.txt
    • Check: Virtual environment activation
  4. Camera/Screen Capture Issues

    • Solution: Grant necessary permissions to terminal/application
    • macOS: System Preferences → Security & Privacy → Camera/Screen Recording

Error Codes

  • API Key not found: Check .env file configuration
  • Connection refused: Verify network connectivity
  • Audio device error: Check microphone/speaker connections
  • Permission denied: Grant required system permissions

📈 Performance Optimization

Audio Processing

  • Use appropriate chunk sizes (1024-2048 bytes)
  • Implement proper buffering strategies
  • Consider audio compression for network efficiency

Video Processing

  • Limit frame rate to 1 FPS for efficiency
  • Resize images to max 1024x1024
  • Use JPEG compression for bandwidth optimization

Function Calling

  • Implement async function execution
  • Use proper error handling and timeouts
  • Cache frequently used data

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

🔗 Related Resources

📞 Support

For issues and questions:

  1. Check the troubleshooting section above
  2. Review the official Gemini documentation
  3. Open an issue in this repository
  4. Join the Google AI developer community

Note: This is an experimental project for testing and learning purposes. Use responsibly and in accordance with Google's terms of service.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published