Nikita Torgashov’s Post

PhD Student @ KTH | Conversational AI & Speech Generation

2mo

VoXtream is now open-sourced! VoXtream is a full-stream zero-shot TTS model for real-time use that begins speaking from the first word. 𝗞𝗲𝘆 𝗳𝗲𝗮𝘁𝘂𝗿𝗲𝘀: - 𝗦𝘁𝗿𝗲𝗮𝗺𝗶𝗻𝗴: Support a full-stream scenario, where the full sentence is not known in advance. The model takes the text stream coming word-by-word as input and outputs an audio stream in 80ms chunks. - 𝗦𝗽𝗲𝗲𝗱: Works 𝟱𝘅 times faster than real-time and achieves 𝟭𝟬𝟮 𝗺𝘀 first packet latency on GPU. - 𝗤𝘂𝗮𝗹𝗶𝘁𝘆 𝗮𝗻𝗱 𝗲𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝗰𝘆: With only 9k hours of training data, it matches or surpasses the quality and intelligibility of larger models or models trained on large datasets. The work was done under the guidance and supervision of Gustav Eje Henter and Gabriel Skantze. 🔗 Paper: https://lnkd.in/d5u_hfcr 🔗 Demo: https://lnkd.in/d7zj2mgc 🔗 Code: https://lnkd.in/dYv5ceib 🔗 Model: https://lnkd.in/dmkq--8K #tts #texttospeech #streaming

27 Comments

Hasan Shoaib

Co-Founder & CTO @Q9labs | Voice AI | AI Agents

2mo

Amazing Nikita Torgashov! Is there a hosted version of this somewhere?

1 Reaction

Anton Pimenov

Principal Data Scientist: einsum(‘domain,task->solution’, [Voice, Face], [Biometric, AntiSpoof, Generate])

2mo

Awesome, may it be adopted for real-time voice conversion?

1 Reaction

Christopher Shulby

Machine Learning Engineering Leader

2mo

Really cool. What is the footprint in VRam more or less? Quality looks good 🙌. Great work

1 Reaction

Anton Okhotnikov

AI Researcher | SotA Speaker Recognition | UK Global Talent

2mo

Amazing work, Nikita! Will certainly give it a shot soon!

3 Reactions

Muhammad Adil Abid

PhD Candidate @Malmö University | Deep Learning, Data Analyst & Optimization | Pre-hospital Stroke Care | Ambulance Travel Time Estimation

2mo

Nicely done! What tool did you use to create this figure? Looks really good.

Julio Cesar Cavalcanti

Postdoctoral Researcher

2mo

Congratulations! Very good job 👏🏼👏🏼👏🏼

1 Reaction

Juan Pablo Montoya

AI @ Google | Prev @ Microsoft, Cisco

1mo

This is so cool!

1 Reaction

Women in AI & Robotics

2mo

Very interesting! Thanks for sharing Nikita Torgashov .

1 Reaction

Shivam Mehta

Research Scientist @ Netflix. PhD, Ex-Intern @Meta and @Microsoft Research. Working with generative probabilistic models. WASP PhD @ KTH Royal Institute of Technology

2mo

Amazing work Nikita Torgashov !!

1 Reaction

Kaylo Littlejohn

Senior Machine Learning Engineer Roblox | PhD Berkeley AI Research

1mo

nice, looks super useful :) !! thanks for making this

1 Reaction

See more comments

To view or add a comment, sign in

More Relevant Posts

Paul Kratz

Customer Solutions Architect @ Vonage | AI, API, Strategic Thinking
1mo
Report this post
🎬 VP9 SVC enables our Media Router to deliver different video resolutions from a single encoded stream, providing many benefits like better video quality on low-bandwidth connections, or improved network efficiency with reduced latency. 💡This technology is more efficient than simulcast and works seamlessly across Chrome, Safari, and Edge. 👀Ready to upgrade your video experience? Read our full blog post for implementation details and technical insights -> https://gag.gl/v8jQTo #mycompany

VP9 SVC in Vonage Video API Released for General Availability developer.vonage.com
Like Comment
To view or add a comment, sign in
Jacqueline Costello

Enterprise Customer Success Manager at Vonage Be Now, What is Next
1mo
Report this post
🎬 VP9 SVC enables our Media Router to deliver different video resolutions from a single encoded stream, providing many benefits like better video quality on low-bandwidth connections, or improved network efficiency with reduced latency. 💡This technology is more efficient than simulcast and works seamlessly across Chrome, Safari, and Edge. 👀Ready to upgrade your video experience? Read our full blog post for implementation details and technical insights -> https://gag.gl/v8jQTo #mycompany

VP9 SVC in Vonage Video API Released for General Availability developer.vonage.com
Like Comment
To view or add a comment, sign in
Alliance for Open Media (AOMedia)

1,794 followers
1mo
Report this post
“During testing, we’ve deployed AV1-encoded video to randomly selected groups of users. The feedback is significantly higher for AV1 versus other codecs.” ~ Meta Research Scientist and Technical Leader Ioannis Katsavounidis  Read more in Meta's AV1 Adoption Story 📚 https://brnw.ch/21wWkgL
Like Comment
To view or add a comment, sign in
MockMe.ai

21 followers
1mo
Report this post
🎧 System Design Interview Question (Google) Q: Design a denoising system for audio that supports both real-time and batch applications — balancing latency, quality, and deployability. A (sneak peek): Hybrid DSP + ML approach Real-time path: RNNoise-style low-latency models Batch path: Demucs / Conv-TasNet for high-quality post-processing gRPC streaming APIs for live audio GPU-backed inference fleets and autoscaling Quality metrics: SNR, PESQ, and human MOS Full breakdown 👉 https://lnkd.in/ec3i_2W9
Like Comment
To view or add a comment, sign in
Gladia

6,833 followers
1mo
Report this post
How fast is your speech-to-text really? Latency isn’t one number—it’s a timeline. In our new deep dive, we share a reproducible benchmark and practical SLOs for real-time apps. What's inside: • A practical overview of key latency metrics and definitions: TTFB, partials vs. finals, endpointing latency, RTF • Streaming tips: frame size, network/jitter profiling, and balancing trade-offs (speed vs. accuracy) • Comprehensive guidelines and concrete SLOs you can apply when evaluating STT latency for your application 👉 Read the article: https://lnkd.in/enW22Kri

Gladia - How to measure latency in speech-to-text (TTFB, Partials, Finals, RTF): A deep dive gladia.io
Like Comment
To view or add a comment, sign in
PWV (Preston-Werner Ventures)

539 followers
1mo Edited
Report this post
Sometimes the next billion-dollar company starts as a doodle (not the dog btw). At PWV, we believe some of the biggest ideas in don’t always begin with a pitch deck: they begin with curiosity, creativity, and most likely some experimental lines of code. That’s how we saw tldraw’s playful “Draw Fast” demo evolve into fal's generative media infrastructure, now a $1.5B company powering over 1M developers. Our latest post from GP, David Thyresson explores how small experiments can reveal massive potential, and why “seeing the big picture” often starts with seeing what others overlook.
2 Comments
Like Comment
To view or add a comment, sign in
🚀 Navicstein Chinemerem

Agentic AI | Livekit | Pipecat | Telephony | Python, Node, Go
1mo
Report this post
𝗘𝘃𝗲𝗿𝘆𝗼𝗻𝗲 𝘀𝗮𝗶𝗱 𝗰𝗼𝗻𝗻𝗲𝗰𝘁𝗶𝗻𝗴 𝗔𝘀𝘁𝗲𝗿𝗶𝘀𝗸 𝘁𝗼 Pipecat 𝗰𝗼𝘂𝗹𝗱𝗻’𝘁 𝗯𝗲 𝗱𝗼𝗻𝗲. 𝗜 𝗽𝗿𝗼𝘃𝗲𝗱 𝘁𝗵𝗲𝗺 𝘄𝗿𝗼𝗻𝗴 𝘂𝘀𝗶𝗻𝗴 𝗚𝗼. The problem? There’s no defined or reliable way to transmit real-time voice from Asterisk’s AudioSocket into Pipecat. Even after downsampling or upsampling audio to the proper rate, custom serializers, service validation, and transport adjustments, it simply didn’t work. No audio at all. Just silence and endless debugging. So I built a Go-based AudioSocket implementation that finally made them talk. It handles real-time audio, DTMF events, and hangup (SLIN), plus custom 𝗣𝗶𝗽𝗲𝗰𝗮𝘁 𝘀𝗲𝗿𝗶𝗮𝗹𝗶𝘇𝗲𝗿𝘀, all using existing 𝗪𝗲𝗯𝗦𝗼𝗰𝗸𝗲𝘁 𝘁𝗿𝗮𝗻𝘀𝗽𝗼𝗿𝘁. No need for a custom layer in pipecat itself, and since it’s Go, it can scale effortlessly to hundreds of concurrent connections without breaking a sweat. I’ll be releasing a full video breakdown soon. If your team is wrestling with fragile real-time audio or AI voice integrations, that’s exactly the kind of rescue and production hardening we specialize in, turning messy prototypes into reliable, scalable systems. #Asterisk #Golang #Pipecat #VoiceAI #Telephony #RealTime #WebRTC #Concurrency #OpenSource
3 Comments
Like Comment
To view or add a comment, sign in
Nicola Mastascusa

Freelance VFX Producer, VG Asset Producer
1mo
Report this post
For any of those interested in open source video generation. Best read up. Wan 2.5 preview is behind closed doors for now, and perhaps will not be released, but my sources indicate that Wan 2.5 will eventually be released and open source. Given the communities ability to quantize models, once Wan 2.5 is released, it will quickly get quantized and be able to run on fairly modest consumer GPUs. https://lnkd.in/g3g_37Kk

Qwen qwen.ai
Like Comment
To view or add a comment, sign in
Anmol Gupta

PhD (RuG × IIT Roorkee) | CEO @MythyaVerse | Neuroscience → AI → AGI | LLMs, alignment, cognitive architectures
1mo
Report this post
Sora 2 can do things that are exceptionally difficult for prior video generation models. It's more physically accurate and realistic than prior systems and comes with synchronized audio. It's impressive to see how OpenAI, which had been lagging behind in video generation, has finally managed to solve the physics of gymnasts where other models still struggle. Let's see how long it will take open source to catch up this time!
Like Comment
To view or add a comment, sign in
Daniel Ince

Applied AI Engineer @ AssemblyAI
1mo
Report this post
the ttfb (time to first byte) metric for measuring STT latency for voice agents is inaccurate, here's why you should be using ttct (time to complete transcript) instead: ttfb only measures silence -> start of speech -> first transcript chunk it completely ignores the more critical path which is ttct: end of speech -> silence -> final complete transcript (to be sent to the LLM) this can lead to some interesting conclusions: - streaming models that emit fast but inaccurate partials get rewarded (easy to game) - async models ran in a low latency fashion look like they perform terrible with ttfb due to no mid-speech partials (even tho they can work well in a voice agent) but the thing is: you're not using partials to generate the LLM response, you're using finals, so that is what you should measure models also have different configs for emitting partials vs finals, with partials you want to optimise for speed, finals you want to optimise for accuracy (without affecting latency too much) - so you will get different speeds at AssemblyAI we're laser focused on optimising for TTCT while maintaining industry leading accuracy, and we're shipping a new update today to our streaming model to return transcripts faster than any other provider via a new field called "Utterance" which emits finals on any 160ms silence so you can generate your LLM response as quick as possible - give your voice agents a noticeable speed up today by testing it out ;)

6 Comments
Like Comment
To view or add a comment, sign in

1,582 followers

2 Posts

View Profile Follow

Nikita Torgashov’s Post

More Relevant Posts

Explore content categories