Earlier this week, we hosted our YC Voice Agents event with an incredible lineup of panelists: Hassaan Raza @ Tavus (YC21), Anthony Krivonos @ Toma (YCW24), Arkadiy Telegin @ Leaping AI (YC W25), Max Child @ Volley (YC18).
YC companies were among Sonic’s earliest design partners, and it’s inspiring to see how far real-time voice applications have come — from avatars to call centers to gaming. Here are some highlights: 👇
💡 Advancements in Voice Infrastructure
From 2017 to the early 2020s, the biggest hurdles were hallucinations and accuracy. More recently, latency has been the main blocker. With both accuracy and latency becoming table stakes, the next frontier is all about the user experience:
Naturalness → Not just improving TTS (text-to-speech) and STT (speech-to-text) models themselves, but also the middleware that chains them together for smoother, more human conversation.
Emotional Intelligence → Going beyond transcription to understand true intent.
Personalization → Controllable, brand-differentiated voices that help customers feel recognized.
🎧 What Customers Actually Want
Call Centers → The north star is task completion (e.g. scheduling a sales call or resolving a ticket). Inbound use cases prize accuracy and control, while outbound calls lean more on emotional intelligence and personalization. Reliability at scale and function calling (letting the model trigger actions like booking an appointment) are must-haves.
Avatars → Expressiveness drives realism, especially for digital twins of public figures or creators. Without it, visual rendering falls flat. Latency is also particularly important here due to an even larger number of cascaded models.
Gaming → Entertainment comes first. Players value improvisation, dramatic emotions, and controllability — even hallucinations can feel like part of the fun.
🧩 Cascading Architectures vs. Speech-to-Speech
Cascading Architectures → Chain together speech-to-text, a reasoning model, and text-to-speech. This approach offers precision, controllability, and reliable function calling — making it the backbone for enterprise-grade systems.
Speech-to-Speech → Map input speech directly to output speech. These models shine at emotionality and expressiveness, but they remain weaker at function calling (often <10% success), harder to control, and more expensive to operate. These are starting to see real business adoption in simple reasoning use cases like outbound sales and entertainment.
👉 Thanks again to our panelists and everyone who joined us! Excited to see what the next generation of builders will create.
🚀 YC founders building your Voice AI roadmap: we’re offering 4 free months on our Scale plan to accelerate your stack → https://lnkd.in/gwKcX2pR