This project is a Python-based, real-time conversational AI agent system built on top of LiveKit and Rime.ai. It enables hyper-realistic, character-driven voice agents that can join LiveKit audio rooms, respond to users in natural language, and speak with expressive, customizable voices. The system leverages advanced TTS (text-to-speech) models, OpenAI LLMs (large language models), and a modular plugin architecture for extensibility.
- Project Summary
- Folder Structure
- Key Components
- Core Technologies
- Setup & Installation
- Environment Variables & API Keys
- Running the Agent
- Customization & Prompt Engineering
- Technical Notes
- Demo/Deployment Tips
- References
rime-livekit-agents/
│
├── .env # Environment variables (API keys, URLs)
├── agent_configs.py # Voice/personality configs and prompt engineering
├── rime_agent.py # Main agent logic and entrypoint
├── requirements.txt # Python dependencies
├── text_utils.py # Custom sentence tokenizer for TTS
├── TECHNICAL_OVERVIEW.md # This technical documentation
├── README.md # Basic project info
├── KMS/ # (Optional) Key Management Service or logs
│ └── logs/
└── __pycache__/ # Python bytecode cache
- Main entry point for the agent.
- Handles LiveKit room connection, session management, plugin integration, and event loop.
- Integrates TTS, LLM, STT, noise cancellation, and turn detection.
- Defines agent personalities, TTS settings, and prompt engineering.
- Example:
"celeste"persona with a clingy, playful, flirty university girlfriend style. - Each persona can have unique TTS speed, model, and prompt.
- Implements custom sentence tokenization for advanced TTS models (e.g., Arcana).
- LiveKit: Real-time audio/video infrastructure for scalable, multi-user rooms.
- Rime.ai: Hyper-realistic TTS models ("arcana", "mistv2").
- OpenAI: LLMs (e.g., GPT-4o-mini) for generating conversational responses.
- Python 3.11+: All orchestration and logic.
- LiveKit Plugins: For noise cancellation, turn detection, and TTS enhancements.
git clone https://github.com/uw-datasci/AI-GF.gitWindows:
python -m venv .venv
.venv\Scripts\activatemacOS/Linux:
python3 -m venv .venv
source .venv/bin/activatepip install -r requirements.txtpython rime_agent.py download-filesThis downloads Hugging Face models used when an agent has provider: "huggingface" for TTS or LLM (e.g. Léa). Models are cached locally so the agent can run TTS and LLM via the transformers library in-process. STT is Silero (local) or OpenAI; no Hugging Face STT. Requires transformers and torch (see requirements.txt).
Create a .env file in the project root with the following keys:
LIVEKIT_URL=wss://<your-livekit-server>.livekit.cloud
LIVEKIT_API_KEY=<your-livekit-api-key>
LIVEKIT_API_SECRET=<your-livekit-api-secret>
OPENAI_API_KEY=<your-openai-api-key>
RIME_API_KEY=<your-rime-api-key>
ELEVEN_API_KEY=<your-elevenlabs-api-key>
SMALLEST_API_KEY=<your-smallest-ai-api-key>
# Optional: Tavus avatar integration
TAVUS_API_KEY=<your-tavus-api-key>
TAVUS_REPLICA_ID=<your-tavus-replica-id>Required API Keys:
- LiveKit: For connecting to your LiveKit Cloud or self-hosted server.
- OpenAI: For LLM responses (ensure your key has quota).
- Rime.ai: For TTS (Arcana, Mistv2, etc.).
- ElevenLabs (optional): For ElevenLabs TTS voices.
- Smallest AI (optional): For Waves TTS and Pulse STT. Get a key at console.smallest.ai.
- Tavus (optional): For avatar video integration.
Note:
- Do not surround values with quotes unless the value contains spaces.
- If you see quota errors, check your OpenAI or Rime.ai usage and billing.
Run the agent in console mode for local testing:
python rime_agent.py consoleConnects the agent to a LiveKit room:
python rime_agent.py dev- Ensure all required environment variables are set in
.env. - The agent will join the specified LiveKit room and respond to participants.
- Press
Ctrl + Cin the terminal to stop the agent at any time.
The system prompt can be a plain string or an object with type and content:
- String:
"personality_prompt": "You are Katerina..." - URL:
"personality_prompt": { "type": "URL", "content": "https://example.com/prompt.txt" } - File path:
"personality_prompt": { "type": "File Path", "content": "prompts/katerina.txt" }(relative to project root)
Use content or Content; type is one of: String, URL, File Path.
tts has top-level provider, model, url, and a nested voice_options object for provider-specific options. stt uses the same top-level shape.
"tts": {
"provider": "elevenlabs",
"model": "eleven_multilingual_v2",
"url": null,
"voice_options": { "voice_id": "...", "optimize_streaming_latency": 3 }
},
"stt": { "provider": "openai", "model": "gpt-4o-mini-transcribe", "url": null }- provider / model / url: same as before.
- tts.voice_options: ElevenLabs
voice_id,model_id,optimize_streaming_latency; Kokorovoice,speed,base_url; Rimespeaker,speed_alpha,reduce_latency,max_tokens; Smallest AIvoice_id,speed,sample_rate.
Local / embedded models (no API): For Silero, TTS and STT run locally inside the agent process (torch.hub). For Hugging Face, TTS and LLM can run in-process (transformers). STT is Silero (local) or OpenAI (cloud); Hugging Face STT was removed.
-
Chrystèle (Silero TTS/STT, local LLM):
"tts": { "provider": "silero", "voice_options": { "language": "en", "speaker": "lj_16khz" } },"stt": { "provider": "silero", "language": "en" },"vad": { "provider": "silero", "model": "silero_vad" }. TTS and STT use snakers4/silero-models (torch.hub) in-process. LLM can be LM Studio (OpenAI-compatible URL) or another local server. -
Léa (Hugging Face TTS/LLM, local):
"tts"and"llm"use"provider": "huggingface"with Hugging Face Hub model IDs (seeplugins/hf_tts.py,plugins/hf_llm.py). STT uses OpenAI (or set"stt": { "provider": "silero" }for local). Runpython rime_agent.py download-filesonce to cache HF models. No API for TTS/LLM—models run locally in the agent. -
Smallest AI (cloud):
"tts": { "provider": "smallestai", "model": "lightning", "voice_options": { "voice_id": "emily", "speed": 1.0 } },"stt": { "provider": "smallestai" }. TTS uses Waves API; STT uses Pulse API. RequiresSMALLEST_API_KEYin.env. Get a key at console.smallest.ai.
Alternative (OpenAI-compatible servers): You can use local servers (e.g. Ollama, Whisper API, Kokoro) and "provider": "openai" with "url": "http://localhost:..." in the agent JSON. Those are still local but run in a separate process; Silero and Hugging Face run embedded in the agent with no separate server.
The table below shows which services each provider offers:
| Provider | TTS | STT | VOD (Voice on Demand / Live Voice) | Type | API Key Env Var |
|---|---|---|---|---|---|
| Rime | ✅ Arcana, Mistv2 | ❌ | ✅ LiveKit real-time | Cloud | RIME_API_KEY |
| ElevenLabs | ✅ v2, v3, Turbo | ❌ | ✅ LiveKit real-time | Cloud | ELEVEN_API_KEY |
| OpenAI | ✅ (via compatible API) | ✅ Whisper, gpt-4o-mini-transcribe | ✅ LiveKit real-time | Cloud | OPENAI_API_KEY |
| Smallest AI | ✅ Waves (lightning, lightning-large) | ✅ Pulse (32+ languages) | ✅ LiveKit real-time | Cloud | SMALLEST_API_KEY |
| Kokoro | ✅ OpenAI-compatible local server | ❌ | ✅ LiveKit real-time | Local (server) | None (self-hosted) |
| Silero | ✅ silero_tts (torch.hub) | ✅ silero_stt (torch.hub) | ✅ LiveKit real-time | Local (in-process) | None |
| Hugging Face | ✅ SpeechT5 and others | ❌ (removed; use Silero or OpenAI) | ✅ LiveKit real-time | Local (in-process) | HF_TOKEN (optional) |
| Whisper (local server) | ❌ | ✅ via OpenAI-compatible API | ✅ LiveKit real-time | Local (server) | None (self-hosted) |
| DeepSeek | ❌ | ❌ | ❌ (LLM only) | Cloud | DEEPSEEK_API_KEY |
| ❌ | ❌ | ❌ (LLM only) | Cloud | GOOGLE_API_KEY |
|
| Anthropic | ❌ | ❌ | ❌ (LLM only) | Cloud | ANTHROPIC_API_KEY |
VOD = Voice-on-Demand / live voice interaction via LiveKit. All TTS providers support real-time voice since audio is streamed through LiveKit rooms. Providers marked ❌ for VOD are LLM-only and do not provide speech services.
vad configures Voice Activity Detection (when the user is speaking). It supports provider and model; optionally onnx_file_path for a custom ONNX file when using Silero.
"vad": { "provider": "silero", "model": "silero_vad" }- provider:
"silero"(default) or"huggingface". Silero is used for all agents today; whenprovideris"huggingface", the config is in place for a future HF VAD plugin. - model: Identifier for the VAD model. For Silero, use
"silero_vad"—this is the bundled ONNX model (silero_vad.onnx) from livekit-plugins-silero (snakers4/silero-vad). If omitted, the same default is used. - onnx_file_path (optional): Path to a custom Silero VAD ONNX file; if set, this file is loaded instead of the bundled model.
Chrystèle uses "vad": { "provider": "silero", "model": "silero_vad" }. Léa can use "provider": "huggingface" for future HF VAD alignment.
To make agents sound livelier, use expressive tags in personality_prompt and intro_phrase. The full tag list is injected into the prompt passed to the model based on tts.provider: Rime/Silero/Kokoro use angle brackets (<laugh>, <sigh>, <whis>...</whis>); ElevenLabs v3 uses square brackets ([laughs], [sighs], [whispers]). See docs/LIVEKIT_TTS_TAGS.md for the full list and usage.
- Edit
agent_configs.pyto:- Add new personas (copy the
"celeste"config and modify). - Change TTS speed (
"speed_alpha"), model, or speaker. - Update the
llm_promptfor different conversational styles. - Adjust
intro_phrasefor custom greetings.
- Add new personas (copy the
Example:
"celeste": {
"tts_options": {
"model": "arcana",
"speaker": "celeste",
"speed_alpha": 1.0, # 1.0 = normal speed
...
},
"llm_prompt": "...",
"intro_phrase": "hey cutie... <laugh> I was just thinking about you. what are you up to?",
}- Lower
"speed_alpha"if TTS is too fast for avatar sync.
- Dependencies:
- Uses a forked version of
livekit-plugins-rimefor improved Arcana support (seerequirements.txt). - All audio processing, TTS, and LLM calls are asynchronous for low latency.
- Uses a forked version of
- Plugins:
- Noise cancellation (
livekit-plugins-noise-cancellation) - Turn detection (
livekit-plugins-turn-detector)
- Noise cancellation (
- Extensibility:
- Add new plugins, voices, or logic by extending the agent/session classes.
- Microphone Selection:
- By default, uses the system default input device.
- To change, modify the code to set the desired device index using
sounddeviceor relevant library.
-
For Demos:
- Highlight real-time, character-driven voice interaction.
- Show expressive TTS and persona switching.
- Demonstrate easy customization via
agent_configs.py. - Explain integration with LiveKit for scalable audio rooms.
-
For Production:
- Deploy on a cloud VM or service (e.g., Render, AWS, Azure).
- Use secure storage for API keys.
- Monitor usage and quotas for OpenAI and Rime.ai.
- Optionally, connect a web or mobile frontend via LiveKit APIs.
- Quota Errors:
- If you see
insufficient_quotaor 429 errors, check your OpenAI or Rime.ai account usage and billing.
- If you see
- Audio Sync Issues:
- If TTS audio is faster than the avatar, lower
"speed_alpha"inagent_configs.py.
- If TTS audio is faster than the avatar, lower
- Missing Dependencies:
- Re-run
pip install -r requirements.txtin your activated virtual environment.
- Re-run
- Microphone Issues:
- Ensure your preferred input device is set as default, or modify the code to select a specific device.
- LiveKit Agents Documentation
- Rime.ai
- LiveKit Cloud
- OpenAI Platform
- ElevenLabs
- Smallest AI — Python SDK
- Tavus (if using avatar video)
This document provides a comprehensive technical overview and setup guide for the Rime LiveKit Agents project. For further details, see the codebase and README.