VoXtream is now open-sourced! VoXtream is a full-stream zero-shot TTS model for real-time use that begins speaking from the first word. 𝗞𝗲𝘆 𝗳𝗲𝗮𝘁𝘂𝗿𝗲𝘀: - 𝗦𝘁𝗿𝗲𝗮𝗺𝗶𝗻𝗴: Support a full-stream scenario, where the full sentence is not known in advance. The model takes the text stream coming word-by-word as input and outputs an audio stream in 80ms chunks. - 𝗦𝗽𝗲𝗲𝗱: Works 𝟱𝘅 times faster than real-time and achieves 𝟭𝟬𝟮 𝗺𝘀 first packet latency on GPU. - 𝗤𝘂𝗮𝗹𝗶𝘁𝘆 𝗮𝗻𝗱 𝗲𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝗰𝘆: With only 9k hours of training data, it matches or surpasses the quality and intelligibility of larger models or models trained on large datasets. The work was done under the guidance and supervision of Gustav Eje Henter and Gabriel Skantze. 🔗 Paper: https://lnkd.in/d5u_hfcr 🔗 Demo: https://lnkd.in/d7zj2mgc 🔗 Code: https://lnkd.in/dYv5ceib 🔗 Model: https://lnkd.in/dmkq--8K #tts #texttospeech #streaming
Awesome, may it be adopted for real-time voice conversion?
Really cool. What is the footprint in VRam more or less? Quality looks good 🙌. Great work
Amazing work, Nikita! Will certainly give it a shot soon!
Nicely done! What tool did you use to create this figure? Looks really good.
Congratulations! Very good job 👏🏼👏🏼👏🏼
This is so cool!
Very interesting! Thanks for sharing Nikita Torgashov .
Amazing work Nikita Torgashov !!
nice, looks super useful :) !! thanks for making this
Co-Founder & CTO @Q9labs | Voice AI | AI Agents
2moAmazing Nikita Torgashov! Is there a hosted version of this somewhere?