Voice AI Companions Explained
Voice transforms an AI companion from a texting partner into someone you can actually talk to. Here's how voice AI companions work and what separates good implementations from bad ones.
The two components of voice AI
Voice AI companions require two technologies working together: speech recognition (turning your spoken words into text) and text-to-speech (turning the companion's text reply into spoken audio).
Speech recognition
Most platforms use the Web Speech API (built into Chrome and Safari) or a cloud service like Whisper. The Web Speech API works in-browser with no extra cost; Whisper offers higher accuracy but requires server processing. AI Companion Studio uses the Web Speech API for low-latency mic input.
Text-to-speech
Generic TTS sounds robotic. Premium platforms use services like ElevenLabs, which produces highly expressive, natural voices. AI Companion Studio assigns each of its 20 companions a unique ElevenLabs voice ID, so Ava sounds different from Mei, who sounds different from Ren — consistent with their personality.
Two voice modes: mic input vs full calling
Mic input is the simpler mode: you press a microphone button, speak, your words are transcribed and sent as a text message, and the companion replies. It's conversational but still text-based under the hood.
Full calling is immersive: a full-screen call UI opens, both sides take turns speaking, and the companion's voice replies play through your speaker. AI Companion Studio's calling mode includes a live transcript and a voice activity indicator so you always know whose turn it is to speak.
Mobile considerations
Safari on iOS has specific microphone permission requirements. Platforms must include the correct Permissions-Policy header (microphone=(self)) and handle iOS's pagehide lifecycle event to stop the microphone when the user leaves — otherwise the mic can stay active in the background. AI Companion Studio handles both.
Stripping action text for TTS
AI companions sometimes include action descriptions in asterisks — like *smiles softly*. Well-implemented platforms strip these before sending text to TTS, so the companion doesn't literally say "asterisk smiles softly asterisk" aloud. They also render these actions as styled italic text in the chat interface rather than showing raw asterisks.